CH 05 Data Engineering
CH 05 Data Engineering
Data
Engineering
5.2
5
Data Engineer
■ A data engineer is a professional who prepares and
manages big data that is then analyzed by data
analysts and scientists.
6
Data Scientist
■ Data Scientist make use of Machine Learning,
Deep Learning techniques and Inferential
Modeling and find correlations between data and
create predictive models on the basis of which
he/she can develop recommendation systems
useful for said business.
7
Data Analyst
■ Data Analyst is responsible for:
8
Roles and Responsibilities
9
Data Engineering Process
10
Data Engineering Process
■ The data engineering process covers a
sequence of tasks that turn a large amount
of raw data into a practical product meeting
the needs of analysts, data
scientists, machine learning engineers, and
others.
11
Data Engineering Process
Data ingestion (acquisition) moves data from multiple
sources — SQL and NoSQL databases, IoT devices,
websites, streaming services, etc. — to a target system
to be transformed for further analysis.
12
Data Engineering Process
■ Data transformation adjusts disparate data to
the needs of end users. It involves removing
errors and duplicates from data, normalizing it,
and converting it into the needed format.
13
Data Ingestion Techniques
1) Batch Data Ingestion:
14
Data Ingestion Techniques
2) Real-Time (Stream) Data Ingestion:
15
Data Ingestion Techniques
Lambda Architecture (Hybrid):
16
Best Practices for Data Ingestion
Understand the Data Sources:
● Identify all data sources, their structure (structured, semi-structured,
or unstructured), and their frequency of updates.
● Ensure the ability to handle various types of data (e.g., relational
databases, IoT devices, logs, APIs).
17
Best Practices for Data Ingestion
Data Validation and Cleansing:
● Apply checks to ensure the quality and validity of incoming data.
Common issues such as missing values, duplicates, or incorrect data
formats should be addressed during ingestion.
● Tools like Apache Nifi and Talend can help automate validation and
transformation during ingestion.
Scalability:
● Use scalable solutions that can handle data volume growth over time,
especially with increasing data sources and higher data velocity.
● Consider using cloud-based storage solutions (e.g., Amazon S3,
Google Cloud Storage) for dynamic scaling capabilities.
18
Best Practices for Data Ingestion
Data Deduplication:
● Duplicate data can distort analytics and increase storage costs.
Ensure that your ingestion system includes mechanisms to
identify and remove duplicate records.
19
Best Practices for Data Ingestion
20
Best Practices for Data Ingestion
21
Best Practices for Data Ingestion
Incremental Ingestion:
● Rather than ingesting entire datasets repeatedly, use
techniques that ingest only the new or updated data (delta
loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.
Metadata Management:
● Maintain clear metadata around the ingestion process, such
as data source details, ingestion timestamps, and
transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot.
22
Data Storage Management
■ Data storage is an essential component of any data
architecture, and two of the most common solutions for
storing and managing large volumes of data are data
lakes and data warehouses.
23
Data lake
■ A data lake is a centralized repository that allows you to store vast
amounts of raw data in its original format, whether structured,
semi-structured, or unstructured.
■ The idea behind a data lake is to provide a flexible environment for
storing data without requiring upfront structuring or processing.
24
Data lake
■ A data lake uses the ELT approach and starts data loading
immediately after extracting it, handling raw — often unstructured
— data.
■ A data lake is worth building in those projects that will scale and
need a more advanced architecture.
■ Besides, it’s very convenient when the purpose of the data hasn’t
been determined yet. In this case, you can load data quickly, store
it, and modify it as necessary.
■ Data lakes are also a powerful tool for data scientists and ML
engineers, who would use raw data to prepare it for predictive
analytics and machine learning.
25
Data Warehouse
■ A data warehouse is a highly structured, centralized repository
designed for storing processed and structured data, usually to
support reporting, business intelligence (BI), and analytics.
■ It is designed to optimize query performance for large datasets
and supports OLAP (Online Analytical Processing) workloads.
26
Data Warehouse
OLAP and OLAP cubes
■ OLAP or Online Analytical Processing refers to the computing
approach allowing users to analyze multidimensional data.
■ It’s contrasted with OLTP or Online Transactional Processing, a
simpler method of interacting with databases, not designed for
analyzing massive amounts of data from different
perspectives.
■ Traditional databases resemble spreadsheets, using the
two-dimensional structure of rows and columns.
■ However, in OLAP, datasets are presented in multidimensional
structures -- OLAP cubes.
■ Such structures enable efficient processing and advanced
analysis of vast amounts of varied data.
■ For example, a sales department report would include such
dimensions as product, region, sales representative, sales
amount, month, and so on.
27
28