0% found this document useful (0 votes)
3 views

module5DataEngineering

The document provides an overview of data engineering, covering key concepts such as data ingestion techniques, data storage management, and the roles of data engineers, data scientists, and data analysts. It discusses various data ingestion methods, including batch and real-time processing, and highlights best practices for managing data quality and scalability. Additionally, it contrasts data lakes and data warehouses, emphasizing their distinct functionalities and use cases.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

module5DataEngineering

The document provides an overview of data engineering, covering key concepts such as data ingestion techniques, data storage management, and the roles of data engineers, data scientists, and data analysts. It discusses various data ingestion methods, including batch and real-time processing, and highlights best practices for managing data quality and scalability. Additionally, it contrasts data lakes and data warehouses, emphasizing their distinct functionalities and use cases.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/386573107

module 5 Data Engineering

Chapter · December 2024

CITATIONS READS

0 21

1 author:

Khwaish Shahani
Vivekanand Education Society's Institute of Technology
10 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Khwaish Shahani on 09 December 2024.

The user has requested enhancement of the downloaded file.


Module5
Data
Engineering

Introduction to Data Engineering, Data


Ingestion: Techniques and Best Practices,
Data Storage and Management: Data
Lakes, Data Warehouses, Data
Processing Pipelines.5.2 Lamda
Architecture, Batch Processing, Stream
Processing, Data Quality and
Governance.

Khwaish

0
[1] Write a Short Note On Data Engineering?
[1] Data engineering is a technology, focusing on designing, building, and managing the
infrastructure required to collect, store, and analyze large volumes of data.
[2] It enables organizations to transform raw data into useful insights and is the backbone
of data science, machine learning, and business intelligence.
[3] Data engineering is a set of operations to make data available and usable to data
scientists, data analysts, business intelligence (BI) developers, and other specialists
within an organization.
[4] It takes dedicated experts – data engineers – to design and build systems for gathering
and storing data at scale as well as preparing it for further analysis

Data Engineering
 [1] A data engineer is a professional who prepares and manages big data that is then
analyzed by data analysts and scientists.
 [2] They are responsible for designing, building, integrating, and maintaining data from
several sources, , thus designing the infrastructure of the data that is collected in the
database.

Data Scientist
 Data Scientist make use of Machine Learning, Deep Learning techniques and Inferential
Modeling and find correlations between data and create predictive models on the basis
of which he/she can develop recommendation systems useful for said business

Data Analyst
 Data Analyst is responsible for:
 screening and cleaning/polishing of the raw data collected;
 data preparation;
 understanding of business metrics and problems;
 visualization of data through reports and graphs;
 identification of trends and useful suggestions to aid in strategic business decisions.

1
[5] Data Science Vs Data Engineer Vs Data Analyst

 Fig. 1 Data Scientist Vs Data Engineer Vs Data Analyst .

2
Roles and Responsibilities
Data Analyst Data Engineer Data Scientist

Preprocessing and data gathering Develop ,test and maintain Responsible for developing
architectures operating Models

Emphasis on representating Data Understanding programming Carry out data analytics and
Via reporting and visualization and its complexity optimization using deep learning
and machine Learning .

Responsible for statistical Deploy Ml and statistical Involve in strategic Planning for
Analysis And Data Interpretation . Models Data Analytics

Ensures data acquisition and Building pipelines for Integrate Data and perform Adhoc
Maintenance . various ETL operations Analysis

Optimize Statistical And Ensures data Accuracy and Fill in the gap between customers
Efficiency And quality . Flexibility and stakeholders

3
[Q2] Write a Short Note on Data Engineering Process

 The data engineering process covers a sequence of tasks that turn a large amount of raw
data into a practical product meeting the needs of analysts, data scientists, machine
learning engineers, and others.

 Data ingestion (acquisition) moves data from multiple sources — SQL


and NoSQL databases, IoT devices, websites, streaming services, etc. — to a target
system to be transformed for further analysis.
 Data comes in various forms and can be both structured and unstructured.
 Data transformation adjusts disparate data to the needs of end users. It involves
removing errors and duplicates from data, normalizing it, and converting it into the
needed format.
 Data serving delivers transformed data to end users — a BI platform, dashboard,
or data science team.

Fig 2 : Data Engineering Process

4
[Q3] Write Short Note On Data Ingestion Techniques

1) Batch Data Ingestion:

 involves collecting large amounts of raw data from various sources into one place and
then processing it later.

 Data is collected and processed in intervals (e.g., hourly, daily).


 Use Cases: Suitable for historical data processing, reporting, and use cases where real-
time analysis is not required.
 Tools: Apache Sqoop, AWS Glue, Google Dataflow, Talend.
2) Real-Time (Stream) Data Ingestion:
 involves streaming data into a data warehouse in real-time, often using cloud-based
systems that can ingest the data quickly, store it in the cloud, and then release it to users
almost immediately.
 Use Cases: Real-time analytics, fraud detection, IoT sensor data monitoring, and
alerting.
 Tools: Apache Kafka, Apache Flink, Amazon Kinesis, Apache NiFi, Google Pub/Sub.
Lambda Architecture (Hybrid):
 Combines batch and real-time processing to get the benefits of both techniques. The
real-time layer provides immediate data processing, while the batch layer ensures data
accuracy and completeness by processing larger volumes periodically.

 Use Cases: When both real-time insights and historical data analysis are required (e.g.,
in recommendation systems, social media analytics, and fraud detection).
 Tools: Hadoop (batch processing) + Apache Kafka (stream processing).

5
Best pratices For Data Ingestion
Understand the Data Sources:

 Identify all data sources, their structure (structured, semi-structured, or unstructured),


and their frequency of updates.
 Ensure the ability to handle various types of data (e.g., relational databases, IoT
devices, logs, APIs).
Data Schema Management:

 Ensure schema consistency across datasets. When ingesting data, it is important to


account for evolving schemas (e.g., adding new fields) without breaking the system.
 Use schema registries for real-time data, such as Apache Avro, to enforce structure
during ingestion
Data Validation and Cleansing:

 Apply checks to ensure the quality and validity of incoming data. Common issues such
as missing values, duplicates, or incorrect data formats should be addressed during
ingestion.
 Tools like Apache Nifi and Talend can help automate validation and transformation
during ingestion.
Scalability:

 Use scalable solutions that can handle data volume growth over time, especially with
increasing data sources and higher data velocity.
 Consider using cloud-based storage solutions (e.g., Amazon S3, Google Cloud Storage)
for dynamic scaling capabilities.
Incremental Ingestion:

 Rather than ingesting entire datasets repeatedly, use techniques that ingest only the
new or updated data (delta loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.

Metadata Management:
 Maintain clear metadata around the ingestion process, such as data source details,
ingestion timestamps, and transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot

6
[Q4] Write Short Note on Data Storage management?
 Data storage is an essential component of any data architecture, and two of the most
common solutions for storing and managing large volumes of data are data lakes and
data warehouses.
 Although both serve to store data, they differ significantly in terms of structure, use
cases, and functionality

Data Lake
 A data lake is a centralized repository that allows you to store vast amounts
of raw data in its original format, whether structured, semi-structured, or
unstructured.
 The idea behind a data lake is to provide a flexible environment for storing
data without requiring upfront structuring or processing.

Fig 3 : Data Lake

7
Data WareHouse
OLAP and OLAP cubes
 OLAP or Online Analytical Processing refers to the computing approach allowing users
to analyze multidimensional data.

 It’s contrasted with OLTP or Online Transactional Processing, a simpler method of


interacting with databases, not designed for analyzing massive amounts of data from
different perspectives.

 Traditional databases resemble spreadsheets, using the two-dimensional structure of rows


and columns.

 However, in OLAP, datasets are presented in multidimensional structures -- OLAP


cubes.

 Such structures enable efficient processing and advanced analysis of vast amounts of
varied data.

 For example, a sales department report would include such dimensions as product,
region, sales representative, sales amount, month, and so on .

 Fig 4 : Data Warehouse

References
 What is Data Ingestion? | IBM
 What Is Data Ingestion? | Informatica
 What is Data Ingestion? Tools, Types, and Key Concepts | Simplilearn

View publication stats

You might also like