0% found this document useful (0 votes)
51 views28 pages

CH 05 Data Engineering

Module 5 covers data engineering, focusing on the design, building, and management of infrastructure for data collection, storage, and analysis. It distinguishes between data engineers, data scientists, and data analysts, detailing their roles and responsibilities. The module also discusses data ingestion techniques, storage solutions like data lakes and warehouses, and best practices for managing data quality and governance.

Uploaded by

Rishi Kokil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views28 pages

CH 05 Data Engineering

Module 5 covers data engineering, focusing on the design, building, and management of infrastructure for data collection, storage, and analysis. It distinguishes between data engineers, data scientists, and data analysts, detailing their roles and responsibilities. The module also discusses data ingestion techniques, storage solutions like data lakes and warehouses, and best practices for managing data quality and governance.

Uploaded by

Rishi Kokil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MODULE 5

Data
Engineering

Compile by Dr. Rohini Temkar 1


Module
5.1 5- Contents
• Introduction to Data Engineering, Data Ingestion:
Techniques and Best Practices, Data Storage and
Management: Data Lakes, Data Warehouses, Data
Processing Pipelines.

5.2

• Lamda Architecture, Batch Processing, Stream


Processing, Data Quality and Governance

Compiled by Dr. Rohini Temkar 2


Data engineering
• Data engineering is a technology, focusing on

designing, building, and managing the infrastructure


required to collect, store, and analyze large volumes
of data.

• It enables organizations to transform raw data into

useful insights and is the backbone of data science,


machine learning, and business intelligence.

Compiled by Dr. Rohini Temkar 3


Data engineering
• Data engineering is a set of operations to make data

available and usable to data scientists, data analysts,


business intelligence (BI) developers, and other
specialists within an organization.
• It takes dedicated experts – data engineers – to

design and build systems for gathering and storing


data at scale as well as preparing it for further
analysis.

Compiled by Dr. Rohini Temkar 4


Data Scientist Vs Data Engineer Vs Data Analyst

5
Data Engineer
■ A data engineer is a professional who prepares and
manages big data that is then analyzed by data
analysts and scientists.

■ They are responsible for designing, building,


integrating, and maintaining data from several
sources, , thus designing the infrastructure of the
data that is collected in the database.

6
Data Scientist
■ Data Scientist make use of Machine Learning,
Deep Learning techniques and Inferential
Modeling and find correlations between data and
create predictive models on the basis of which
he/she can develop recommendation systems
useful for said business.

7
Data Analyst
■ Data Analyst is responsible for:

■ screening and cleaning/polishing of the raw data


collected;
■ data preparation;
■ understanding of business metrics and problems;
■ visualization of data through reports and graphs;
■ identification of trends and useful suggestions to
aid in strategic business decisions.

8
Roles and Responsibilities

9
Data Engineering Process

10
Data Engineering Process
■ The data engineering process covers a
sequence of tasks that turn a large amount
of raw data into a practical product meeting
the needs of analysts, data
scientists, machine learning engineers, and
others.

11
Data Engineering Process
Data ingestion (acquisition) moves data from multiple
sources — SQL and NoSQL databases, IoT devices,
websites, streaming services, etc. — to a target system
to be transformed for further analysis.

Data comes in various forms and can be both structured


and unstructured.

12
Data Engineering Process
■ Data transformation adjusts disparate data to
the needs of end users. It involves removing
errors and duplicates from data, normalizing it,
and converting it into the needed format.

■ Data serving delivers transformed data to end


users — a BI platform, dashboard, or data science
team.

13
Data Ingestion Techniques
1) Batch Data Ingestion:

● involves collecting large amounts of raw data from various


sources into one place and then processing it later.

● Data is collected and processed in intervals (e.g., hourly,


daily).

● Use Cases: Suitable for historical data processing, reporting,


and use cases where real-time analysis is not required.
● Tools: Apache Sqoop, AWS Glue, Google Dataflow, Talend.

14
Data Ingestion Techniques
2) Real-Time (Stream) Data Ingestion:

● involves streaming data into a data warehouse in real-time,


often using cloud-based systems that can ingest the data
quickly, store it in the cloud, and then release it to users
almost immediately.

● Use Cases: Real-time analytics, fraud detection, IoT sensor


data monitoring, and alerting.
● Tools: Apache Kafka, Apache Flink, Amazon Kinesis,
Apache NiFi, Google Pub/Sub.

15
Data Ingestion Techniques
Lambda Architecture (Hybrid):

● Combines batch and real-time processing to get the benefits


of both techniques. The real-time layer provides immediate
data processing, while the batch layer ensures data accuracy
and completeness by processing larger volumes periodically.

● Use Cases: When both real-time insights and historical data


analysis are required (e.g., in recommendation systems,
social media analytics, and fraud detection).
● Tools: Hadoop (batch processing) + Apache Kafka (stream
processing).

16
Best Practices for Data Ingestion
Understand the Data Sources:
● Identify all data sources, their structure (structured, semi-structured,
or unstructured), and their frequency of updates.
● Ensure the ability to handle various types of data (e.g., relational
databases, IoT devices, logs, APIs).

Data Schema Management:

● Ensure schema consistency across datasets. When ingesting data, it


is important to account for evolving schemas (e.g., adding new fields)
without breaking the system.
● Use schema registries for real-time data, such as Apache Avro, to
enforce structure during ingestion.

17
Best Practices for Data Ingestion
Data Validation and Cleansing:
● Apply checks to ensure the quality and validity of incoming data.
Common issues such as missing values, duplicates, or incorrect data
formats should be addressed during ingestion.
● Tools like Apache Nifi and Talend can help automate validation and
transformation during ingestion.

Scalability:

● Use scalable solutions that can handle data volume growth over time,
especially with increasing data sources and higher data velocity.
● Consider using cloud-based storage solutions (e.g., Amazon S3,
Google Cloud Storage) for dynamic scaling capabilities.

18
Best Practices for Data Ingestion

Data Deduplication:
● Duplicate data can distort analytics and increase storage costs.
Ensure that your ingestion system includes mechanisms to
identify and remove duplicate records.

Optimize Throughput and Latency:

● For real-time ingestion, reduce latency by using efficient,


low-latency transport layers like Apache Kafka or AWS Kinesis.
● In batch ingestion, ensure the throughput is maximized by tuning
the data transfer rates and scheduling ingestion during off-peak
hours to optimize system performance.

19
Best Practices for Data Ingestion

Data Compression and Serialization:


● To optimize storage and transmission, use efficient serialization
formats (e.g., Parquet, Avro, ORC) and compress data where
possible. These formats are especially useful for handling large
datasets efficiently.

Error Handling and Monitoring:

● Implement proper logging, error handling, and retry mechanisms


for failed ingestion attempts. Tools like Datadog and Prometheus
can help monitor the ingestion process.
● Ensure that you have alerting systems in place in case of data
pipeline failures or bottlenecks.

20
Best Practices for Data Ingestion

Secure Data Transfers:


● Ensure encryption during data transit using HTTPS or other
secure protocols. Additionally, secure access to data sources by
enforcing authentication and access controls.
● For sensitive data, ensure compliance with regulations such as
GDPR or HIPAA by masking or anonymizing sensitive fields.

Data Partitioning and Load Balancing:

● Partition large datasets to improve ingestion speed and


scalability. Tools like Apache Kafka allow for partitioned topic
structures for distributed ingestion.
● Load balance the ingestion workloads across multiple nodes or
systems to avoid bottlenecks.

21
Best Practices for Data Ingestion

Incremental Ingestion:
● Rather than ingesting entire datasets repeatedly, use
techniques that ingest only the new or updated data (delta
loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.
Metadata Management:
● Maintain clear metadata around the ingestion process, such
as data source details, ingestion timestamps, and
transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot.

22
Data Storage Management
■ Data storage is an essential component of any data
architecture, and two of the most common solutions for
storing and managing large volumes of data are data
lakes and data warehouses.

■ Although both serve to store data, they differ


significantly in terms of structure, use cases, and
functionality.

23
Data lake
■ A data lake is a centralized repository that allows you to store vast
amounts of raw data in its original format, whether structured,
semi-structured, or unstructured.
■ The idea behind a data lake is to provide a flexible environment for
storing data without requiring upfront structuring or processing.

24
Data lake
■ A data lake uses the ELT approach and starts data loading
immediately after extracting it, handling raw — often unstructured
— data.

■ A data lake is worth building in those projects that will scale and
need a more advanced architecture.

■ Besides, it’s very convenient when the purpose of the data hasn’t
been determined yet. In this case, you can load data quickly, store
it, and modify it as necessary.

■ Data lakes are also a powerful tool for data scientists and ML
engineers, who would use raw data to prepare it for predictive
analytics and machine learning.

25
Data Warehouse
■ A data warehouse is a highly structured, centralized repository
designed for storing processed and structured data, usually to
support reporting, business intelligence (BI), and analytics.
■ It is designed to optimize query performance for large datasets
and supports OLAP (Online Analytical Processing) workloads.

26
Data Warehouse
OLAP and OLAP cubes
■ OLAP or Online Analytical Processing refers to the computing
approach allowing users to analyze multidimensional data.
■ It’s contrasted with OLTP or Online Transactional Processing, a
simpler method of interacting with databases, not designed for
analyzing massive amounts of data from different
perspectives.
■ Traditional databases resemble spreadsheets, using the
two-dimensional structure of rows and columns.
■ However, in OLAP, datasets are presented in multidimensional
structures -- OLAP cubes.
■ Such structures enable efficient processing and advanced
analysis of vast amounts of varied data.
■ For example, a sales department report would include such
dimensions as product, region, sales representative, sales
amount, month, and so on.

27
28

You might also like