0% found this document useful (0 votes)

7 views10 pages

module5DataEngineering

The document provides an overview of data engineering, covering key concepts such as data ingestion techniques, data storage management, and the roles of data engineers, data scientists, and data analysts. It discusses various data ingestion methods, including batch and real-time processing, and highlights best practices for managing data quality and scalability. Additionally, it contrasts data lakes and data warehouses, emphasizing their distinct functionalities and use cases.

Uploaded by

Andita Dwiyoga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views10 pages

module5DataEngineering

Uploaded by

Andita Dwiyoga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/386573107

module 5 Data Engineering

Chapter · December 2024

CITATIONS READS

0 21

1 author:

Khwaish Shahani
Vivekanand Education Society's Institute of Technology
10 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Khwaish Shahani on 09 December 2024.

The user has requested enhancement of the downloaded file.

Module5
Data
Engineering

Introduction to Data Engineering, Data

Ingestion: Techniques and Best Practices,
Data Storage and Management: Data
Lakes, Data Warehouses, Data
Processing Pipelines.5.2 Lamda
Architecture, Batch Processing, Stream
Processing, Data Quality and
Governance.

Khwaish

0
[1] Write a Short Note On Data Engineering?
[1] Data engineering is a technology, focusing on designing, building, and managing the
infrastructure required to collect, store, and analyze large volumes of data.
[2] It enables organizations to transform raw data into useful insights and is the backbone
of data science, machine learning, and business intelligence.
[3] Data engineering is a set of operations to make data available and usable to data
scientists, data analysts, business intelligence (BI) developers, and other specialists
within an organization.
[4] It takes dedicated experts – data engineers – to design and build systems for gathering
and storing data at scale as well as preparing it for further analysis

Data Engineering
 [1] A data engineer is a professional who prepares and manages big data that is then
analyzed by data analysts and scientists.
 [2] They are responsible for designing, building, integrating, and maintaining data from
several sources, , thus designing the infrastructure of the data that is collected in the
database.

Data Scientist
 Data Scientist make use of Machine Learning, Deep Learning techniques and Inferential
Modeling and find correlations between data and create predictive models on the basis
of which he/she can develop recommendation systems useful for said business

Data Analyst
 Data Analyst is responsible for:
 screening and cleaning/polishing of the raw data collected;
 data preparation;
 understanding of business metrics and problems;
 visualization of data through reports and graphs;
 identification of trends and useful suggestions to aid in strategic business decisions.

1
[5] Data Science Vs Data Engineer Vs Data Analyst

 Fig. 1 Data Scientist Vs Data Engineer Vs Data Analyst .

2
Roles and Responsibilities
Data Analyst Data Engineer Data Scientist

Preprocessing and data gathering Develop ,test and maintain Responsible for developing
architectures operating Models

Emphasis on representating Data Understanding programming Carry out data analytics and
Via reporting and visualization and its complexity optimization using deep learning
and machine Learning .

Responsible for statistical Deploy Ml and statistical Involve in strategic Planning for
Analysis And Data Interpretation . Models Data Analytics

Ensures data acquisition and Building pipelines for Integrate Data and perform Adhoc
Maintenance . various ETL operations Analysis

Optimize Statistical And Ensures data Accuracy and Fill in the gap between customers
Efficiency And quality . Flexibility and stakeholders

3
[Q2] Write a Short Note on Data Engineering Process

 The data engineering process covers a sequence of tasks that turn a large amount of raw
data into a practical product meeting the needs of analysts, data scientists, machine
learning engineers, and others.

 Data ingestion (acquisition) moves data from multiple sources — SQL

and NoSQL databases, IoT devices, websites, streaming services, etc. — to a target
system to be transformed for further analysis.
 Data comes in various forms and can be both structured and unstructured.
 Data transformation adjusts disparate data to the needs of end users. It involves
removing errors and duplicates from data, normalizing it, and converting it into the
needed format.
 Data serving delivers transformed data to end users — a BI platform, dashboard,
or data science team.

Fig 2 : Data Engineering Process

4
[Q3] Write Short Note On Data Ingestion Techniques

1) Batch Data Ingestion:

 involves collecting large amounts of raw data from various sources into one place and
then processing it later.

 Data is collected and processed in intervals (e.g., hourly, daily).

 Use Cases: Suitable for historical data processing, reporting, and use cases where real-
time analysis is not required.
 Tools: Apache Sqoop, AWS Glue, Google Dataflow, Talend.
2) Real-Time (Stream) Data Ingestion:
 involves streaming data into a data warehouse in real-time, often using cloud-based
systems that can ingest the data quickly, store it in the cloud, and then release it to users
almost immediately.
 Use Cases: Real-time analytics, fraud detection, IoT sensor data monitoring, and
alerting.
 Tools: Apache Kafka, Apache Flink, Amazon Kinesis, Apache NiFi, Google Pub/Sub.
Lambda Architecture (Hybrid):
 Combines batch and real-time processing to get the benefits of both techniques. The
real-time layer provides immediate data processing, while the batch layer ensures data
accuracy and completeness by processing larger volumes periodically.

 Use Cases: When both real-time insights and historical data analysis are required (e.g.,
in recommendation systems, social media analytics, and fraud detection).
 Tools: Hadoop (batch processing) + Apache Kafka (stream processing).

5
Best pratices For Data Ingestion
Understand the Data Sources:

 Identify all data sources, their structure (structured, semi-structured, or unstructured),

and their frequency of updates.
 Ensure the ability to handle various types of data (e.g., relational databases, IoT
devices, logs, APIs).
Data Schema Management:

 Ensure schema consistency across datasets. When ingesting data, it is important to

account for evolving schemas (e.g., adding new fields) without breaking the system.
 Use schema registries for real-time data, such as Apache Avro, to enforce structure
during ingestion
Data Validation and Cleansing:

 Apply checks to ensure the quality and validity of incoming data. Common issues such
as missing values, duplicates, or incorrect data formats should be addressed during
ingestion.
 Tools like Apache Nifi and Talend can help automate validation and transformation
during ingestion.
Scalability:

 Use scalable solutions that can handle data volume growth over time, especially with
increasing data sources and higher data velocity.
 Consider using cloud-based storage solutions (e.g., Amazon S3, Google Cloud Storage)
for dynamic scaling capabilities.
Incremental Ingestion:

 Rather than ingesting entire datasets repeatedly, use techniques that ingest only the
new or updated data (delta loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.

Metadata Management:
 Maintain clear metadata around the ingestion process, such as data source details,
ingestion timestamps, and transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot

6
[Q4] Write Short Note on Data Storage management?
 Data storage is an essential component of any data architecture, and two of the most
common solutions for storing and managing large volumes of data are data lakes and
data warehouses.
 Although both serve to store data, they differ significantly in terms of structure, use
cases, and functionality

Data Lake
 A data lake is a centralized repository that allows you to store vast amounts
of raw data in its original format, whether structured, semi-structured, or
unstructured.
 The idea behind a data lake is to provide a flexible environment for storing
data without requiring upfront structuring or processing.

Fig 3 : Data Lake

7
Data WareHouse
OLAP and OLAP cubes
 OLAP or Online Analytical Processing refers to the computing approach allowing users
to analyze multidimensional data.

 It’s contrasted with OLTP or Online Transactional Processing, a simpler method of

interacting with databases, not designed for analyzing massive amounts of data from
different perspectives.

 Traditional databases resemble spreadsheets, using the two-dimensional structure of rows

and columns.

 However, in OLAP, datasets are presented in multidimensional structures -- OLAP

cubes.

 Such structures enable efficient processing and advanced analysis of vast amounts of
varied data.

 For example, a sales department report would include such dimensions as product,
region, sales representative, sales amount, month, and so on .

 Fig 4 : Data Warehouse

References
 What is Data Ingestion? | IBM
 What Is Data Ingestion? | Informatica
 What is Data Ingestion? Tools, Types, and Key Concepts | Simplilearn

View publication stats

Auerbach C. SSD For R. An R Package For Analyzing..Data 2ed 2022
50% (4)
Auerbach C. SSD For R. An R Package For Analyzing..Data 2ed 2022
296 pages
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
96% (26)
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
616 pages
Profisee Datasheet Integrator 8.5x11
No ratings yet
Profisee Datasheet Integrator 8.5x11
1 page
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Grammarly Cookies
100% (2)
Grammarly Cookies
7 pages
Peer Graded Assignment Data Analytics
No ratings yet
Peer Graded Assignment Data Analytics
7 pages
Ch 05 Data Engineering.pptx (2)
No ratings yet
Ch 05 Data Engineering.pptx (2)
28 pages
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IDA Essay question - answer copy
No ratings yet
IDA Essay question - answer copy
6 pages
Data Engineering UNIT-1 (2)
No ratings yet
Data Engineering UNIT-1 (2)
5 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Page 2
No ratings yet
Page 2
3 pages
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Module 2 Data Engineering 6 Mark Answers
No ratings yet
Module 2 Data Engineering 6 Mark Answers
3 pages
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The roles of Data Engineer and Data Analyst
No ratings yet
The roles of Data Engineer and Data Analyst
4 pages
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Redash Data Analytics and Dashboarding: Definitive Reference for Developers and Engineers
From Everand
Redash Data Analytics and Dashboarding: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CloverDX Design and Integration Solutions: Definitive Reference for Developers and Engineers
From Everand
CloverDX Design and Integration Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Business Intelligence with QuickSight
From Everand
Effective Business Intelligence with QuickSight
Rajesh Nadipalli
No ratings yet
essentials-of-data-engineeringByMukeshSaini
No ratings yet
essentials-of-data-engineeringByMukeshSaini
30 pages
Pentaho Solutions and Architecture: Definitive Reference for Developers and Engineers
From Everand
Pentaho Solutions and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
ThoughtSpot Analytics and Administration: Definitive Reference for Developers and Engineers
From Everand
ThoughtSpot Analytics and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Data Integration with Hevo: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Data Integration with Hevo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
4.data Engineering
No ratings yet
4.data Engineering
9 pages
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
No ratings yet
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
431 pages
Coursera - IBM - Introduction To Data Analytics
No ratings yet
Coursera - IBM - Introduction To Data Analytics
13 pages
Technical Guide to H2O Application and Workflow: Definitive Reference for Developers and Engineers
From Everand
Technical Guide to H2O Application and Workflow: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
13 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Domo Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Domo Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Article
No ratings yet
Article
2 pages
DBeaver Essentials: Definitive Reference for Developers and Engineers
From Everand
DBeaver Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
From Everand
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Oracle Data Integrator Essentials: Definitive Reference for Developers and Engineers
From Everand
Oracle Data Integrator Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering QB 14 Aug v1.0 (1)
No ratings yet
Data Engineering QB 14 Aug v1.0 (1)
40 pages
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
From Everand
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Matillion for Data Integration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Matillion for Data Integration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Applied Analytics with Spotfire: Definitive Reference for Developers and Engineers
From Everand
Applied Analytics with Spotfire: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
QuickSight Essentials: Definitive Reference for Developers and Engineers
From Everand
QuickSight Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DM Lecture 5
No ratings yet
DM Lecture 5
31 pages
Big Data
No ratings yet
Big Data
51 pages
Data Engineering(Ut-2)
No ratings yet
Data Engineering(Ut-2)
22 pages
Efficient Analytics with ClickHouse: Definitive Reference for Developers and Engineers
From Everand
Efficient Analytics with ClickHouse: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Database Management with HeidiSQL: Definitive Reference for Developers and Engineers
From Everand
Efficient Database Management with HeidiSQL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BASHARAAINA_TASK5_DATA_ENGINEER_VIX_BTPNS
No ratings yet
BASHARAAINA_TASK5_DATA_ENGINEER_VIX_BTPNS
15 pages
s13222-024-00490-5
No ratings yet
s13222-024-00490-5
5 pages
Pygrass: An Object Oriented Python API For GRASS GIS: February 2013
No ratings yet
Pygrass: An Object Oriented Python API For GRASS GIS: February 2013
20 pages
Atd Epm Ermt
No ratings yet
Atd Epm Ermt
2 pages
1.7 ACOPOS 1640, 128M: Technical Data - ACOPOS Servo Family
No ratings yet
1.7 ACOPOS 1640, 128M: Technical Data - ACOPOS Servo Family
20 pages
Topic CRM strategies of Nykaa
No ratings yet
Topic CRM strategies of Nykaa
3 pages
Midterm Exam Spreadsheet
No ratings yet
Midterm Exam Spreadsheet
5 pages
Computer Graphics and Clipping
No ratings yet
Computer Graphics and Clipping
56 pages
Operating Manual: Nuclear Gauge Calibration and Verification System
No ratings yet
Operating Manual: Nuclear Gauge Calibration and Verification System
45 pages
Programming and Pseudo Code Algorithms
No ratings yet
Programming and Pseudo Code Algorithms
17 pages
Properties of Relations
No ratings yet
Properties of Relations
8 pages
Material - No.02.advantages and Disadvantages Essay
No ratings yet
Material - No.02.advantages and Disadvantages Essay
5 pages
Cenelec en 50128
No ratings yet
Cenelec en 50128
5 pages
IEEE an Empirical Study of Information Extraction From Vietnamese Documents-compressed
No ratings yet
IEEE an Empirical Study of Information Extraction From Vietnamese Documents-compressed
5 pages
question paper -1 with MS
No ratings yet
question paper -1 with MS
18 pages
Doubly Linked List
No ratings yet
Doubly Linked List
5 pages
Todays Workshop Notes
No ratings yet
Todays Workshop Notes
9 pages
How To Find Correct Firmware For Samsung Device
No ratings yet
How To Find Correct Firmware For Samsung Device
11 pages
Cie 1 2019
No ratings yet
Cie 1 2019
3 pages
UsersGuide - RISO SF5350EIIAG - SF5x30EIIAG - ENG
No ratings yet
UsersGuide - RISO SF5350EIIAG - SF5x30EIIAG - ENG
136 pages
15A05806 Cyber Security
No ratings yet
15A05806 Cyber Security
1 page
Unit 3 - Week 2 Lectures: Building Your Webapp: Assignment 2
No ratings yet
Unit 3 - Week 2 Lectures: Building Your Webapp: Assignment 2
5 pages
AIDA Article Template
No ratings yet
AIDA Article Template
23 pages
Computer Applications Technology P2 Nov 2023 MG Eng
No ratings yet
Computer Applications Technology P2 Nov 2023 MG Eng
14 pages
Grade 7 study material
No ratings yet
Grade 7 study material
3 pages
HXC Floppy Emulator Software User Manual ENG
No ratings yet
HXC Floppy Emulator Software User Manual ENG
17 pages
Wize Technical Presentation - MAY 2021
No ratings yet
Wize Technical Presentation - MAY 2021
14 pages
Lesson Plan in Mathematics Laws of Exponents
100% (4)
Lesson Plan in Mathematics Laws of Exponents
13 pages
Answer: D: Explanation
No ratings yet
Answer: D: Explanation
33 pages

module5DataEngineering

Uploaded by

module5DataEngineering

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

module 5 Data Engineering

Chapter · December 2024

The user has requested enhancement of the downloaded file.

Introduction to Data Engineering, Data

 Fig. 1 Data Scientist Vs Data Engineer Vs Data Analyst .

 Data ingestion (acquisition) moves data from multiple sources — SQL

Fig 2 : Data Engineering Process

1) Batch Data Ingestion:

 Data is collected and processed in intervals (e.g., hourly, daily).

 Identify all data sources, their structure (structured, semi-structured, or unstructured),

 Ensure schema consistency across datasets. When ingesting data, it is important to

Fig 3 : Data Lake

 It’s contrasted with OLTP or Online Transactional Processing, a simpler method of

 Traditional databases resemble spreadsheets, using the two-dimensional structure of rows

 However, in OLAP, datasets are presented in multidimensional structures -- OLAP

 Fig 4 : Data Warehouse

View publication stats

You might also like