0% found this document useful (0 votes)
3 views31 pages

DM Lecture 5

Data engineering involves designing and building systems for data aggregation, storage, and analysis, enabling organizations to derive insights from large datasets. Key responsibilities include data collection, real-time analysis, and machine learning support, with a focus on creating efficient data pipelines and ensuring data quality. The process encompasses data ingestion, transformation, and serving, ultimately supporting data scientists and analysts in their work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views31 pages

DM Lecture 5

Data engineering involves designing and building systems for data aggregation, storage, and analysis, enabling organizations to derive insights from large datasets. Key responsibilities include data collection, real-time analysis, and machine learning support, with a focus on creating efficient data pipelines and ensuring data quality. The process encompasses data ingestion, transformation, and serving, ultimately supporting data scientists and analysts in their work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 2

Data Engineering

1
2025 Dr. Ibrahim Al-Baltah
What is data engineering
Data engineering is the practice of designing
and building systems for the aggregation,
storage and analysis of data at scale. Data
engineers empower organizations to get
insights in real time from large datasets.
From social media and marketing metrics to
employee performance statistics and trend
forecasts, enterprises have all the data they
need to compile a holistic view of their
operations.
Data engineers transform massive quantities
of data into valuable strategic findings.

2
2025 Dr. Ibrahim Al-Baltah
What is data engineering…
 Data engineers govern data management for
downstream use including analysis,
forecasting or machine learning.
As specialized computer scientists, data
engineers excel at creating and deploying
algorithms, data pipelines and workflows that
sort raw data into ready-to-use datasets.
Data engineering is an integral component of
the modern data platform and makes it
possible for businesses to analyze and apply
the data they receive, regardless of the data
source or format.

3
2025 Dr. Ibrahim Al-Baltah
Data engineering use cases
Data engineers have a range of day-to-
day responsibilities. Here are several key
use cases for data engineering:
Data collection, storage and
management
◦ Data engineers streamline data intake and
storage across an organization for convenient
access and analysis. This facilitates scalability
by storing data efficiently and establishing
processes to manage it in a way that is easy
to maintain as a business grows. The field of
DataOps automates data management and is
made possible by the work of data engineers.
4
2025 Dr. Ibrahim Al-Baltah
Data engineering use cases…
Real-time data analysis
◦ With the right data pipelines in place,
businesses can automate the processes of
collecting, cleaning and formatting data for use
in data analytics. When vast quantities of
usable data are accessible from one location,
data analysts can easily find the information
they need to help business leaders learn and
make key strategic decisions.
◦ The solutions that data engineers create set
the stage for real-time learning as data flows
into data models that serve as living
representations of an organization's status at
any given moment.
5
2025 Dr. Ibrahim Al-Baltah
Data engineering use cases…
 Machine learning
 Machine learning (ML) uses vast reams of
data to train artificial intelligence (AI)
models and improve their accuracy. From
the product recommendation services
seen in many e-commerce platforms to
the fast-growing field of generative AI
(gen AI), ML algorithms are in widespread
use. Machine learning engineers rely on
data pipelines to transport data from the
point at which it is collected to the
models that consume it for training.

6
2025 Dr. Ibrahim Al-Baltah
How does data engineering
work?
 Data engineering governs the design and
creation of the data pipelines that convert
raw, unstructured data into unified datasets
that preserve data quality and reliability.
 Data pipelines form the backbone of a well-
functioning data infrastructure and are
informed by the data architecture
requirements of the business they serve.
 Data observability is the practice by which
data engineers monitor their pipelines to
ensure that end users receive reliable data.
 The data integration pipeline contains three
key phases:
2025
1. Data ingestion Dr. Ibrahim Al-Baltah
7
Data ingestion
 Data ingestion is the movement of data from various
sources into a single ecosystem. These sources can
include databases, cloud computing platforms such
as Amazon Web Services (AWS), IoT devices, data
lakes and warehouses, websites and other customer
touchpoints.
 Data engineers use APIs to connect many of these
data points into their pipelines.
 Each data source stores and formats data in a
specific way, which may be structured or
unstructured. While structured data is already
formatted for efficient access, unstructured data is
not.
 Through data ingestion, the data is unified into an
organized data system ready for further refinement.
8
2025 Dr. Ibrahim Al-Baltah
Data transformation
Data transformation prepares the
ingested data for end users such
as executives or machine
learning engineers. It is a hygiene
exercise that finds and corrects
errors, removes duplicate entries
and normalizes data for greater
data reliability. Then, the data is
converted into the format
required by the end user.
9
2025 Dr. Ibrahim Al-Baltah
Data serving
Once the data has been collected
and processed, it’s delivered to
the end user.
Real-time data modeling and
visualization, machine learning
datasets and automated
reporting systems are all
examples of common data
serving methods.

10
2025 Dr. Ibrahim Al-Baltah
Data engineering, data analysis and data
science

Data scientists
◦ Use machine learning, data
exploration and other academic
fields to predict future outcomes.
Data science is an interdisciplinary
field focused on making accurate
predictions through algorithms and
statistical models. Like data
engineering, data science is a code-
heavy role requiring an extensive
programming background.
11
2025 Dr. Ibrahim Al-Baltah
Data engineering, data analysis
and data science…
Data analysts
◦ Examine large datasets to identify
trends and extract insights to help
organizations make data-driven
decisions today. While data scientists
apply advanced computational
techniques to manipulate data, data
analysts work with predefined
datasets to uncover critical
information and draw meaningful
conclusions.
12
2025 Dr. Ibrahim Al-Baltah
Data engineering, data analysis
and data science…
Data engineers
◦ Data engineers are software
engineers who build and maintain an
enterprise’s data infrastructure
automating data integration, creating
efficient data storage models and
enhancing data quality via pipeline
observability. Data scientists and
analysts rely on data engineers to
provide them with the reliable, high-
quality data they need for their work.
13
2025 Dr. Ibrahim Al-Baltah
Data pipelines: ETL vs. ELT
 When building a pipeline, a data engineer automates
the data integration process with scripts—lines of code
that perform repetitive tasks. Depending on their
organization's needs, data engineers construct
pipelines in one of two formats: ETL or ELT.
 ETL: extract, transform, load.
◦ ETL pipelines automate the retrieval and storage of
data in a database. The raw data is extracted from
the source, transformed into a standardized format
by scripts and then loaded into a storage
destination. ETL is the most commonly used data
integration method, especially when combining data
from multiple sources into a unified format.
 ELT: extract, load, transform.
 ELT pipelines extract raw data and import it into a
centralized repository before standardizing it
2025 through transformation. The collected
Dr. Ibrahim Al-Baltahdata can
14
Data engineering tasks for
machine learning

15
2025 Dr. Ibrahim Al-Baltah
16
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning
 Data Sourcing
◦ Data sourcing is the determination and
gathering of data from various sources. Even
prior to that is to identify the various sources
and how often new data is generated. The
data sources can also involve various data
types including but not limited to text, video
data, audio data, pictures...
 Data Exploration

◦ Data exploration is all about getting to know a


dataset. Data exploration involves
understanding distribution of each data point,
understanding correlation between one or
more data points, looking at various patterns,
2025
identifying whether missing values exist, and
Dr. Ibrahim Al-Baltah
17
Data engineering tasks for machine
learning…
Data Cleaning
◦ Data cleaning involves cleaning up
errors and outliers in data points that
may negatively affect and possibly
skew the dataset into bias. Data
cleaning also is about handling those
missing values and data duplication.
◦ Clean data leads to accurate analysis
and accurate regression models and
makes the judgment more reliant.

18
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Data Wrangling
◦ The most common data wrangling technique is
often referred to as data standardization, where
the data is transformed into a range of values
that can be measured in common scale.
◦ Other data wrangling tools include feature scaling
and variable encoding where data is normalized
and categorical variables are converted into
numerical values in a specific format that can be
easily analyzed by a mathematical model.
◦ While every step in this model building process is
equally important, data wrangling directly deals
with improving the accuracy of machine learning
models.

19
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Data Integration
 Data integration would also involve further
identifying and removing duplicates that may arise
from combining several data sources.
 Data integration involves schema alignment, where
data from one or more sources are compatible and
aligned.
 Data integration also deals with data quality, where
one would develop validations and checks to ensure
data quality is good after the data is integrated
from several sources.
 Data integration helps expand the scope of the
entire project. Data integration also helps improve
accuracy and integrity of the dataset.
20
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Feature Engineering
◦ In addition to the existing data points in the
dataset, feature engineering involves creating
new data points based on the existing data
points.
◦ Some of the simplest data points are
calculating the age of a person either at birth
or at a specific date, calculating Fahrenheit
from Celsius, or calculating distance between
two points based on a specific distance
measure, to name a few.
◦ The idea is to come up with new data points
that have higher correlation with the data point
that the model is looking to predict or classify.
21
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
Feature Selection
◦ It is time to select which of these data points
enter into the model. There are many ways of
selecting the features for a given machine
learning model.
◦ Tests like the chi-square test help determine
the relationship between categorical
variables, while metrics like the Fisher score
help generate ranks for various data points.
◦ There are approaches like forward selection or
backward selection that iteratively generate
various feature combinations and respective
accuracy scores.
22
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Data Splitting
◦ Data splitting is the process of partitioning the entire
dataset into smaller chunks for training a machine
learning model. Depending upon the training strategy,
the approach of splitting the data may differ.
◦ The most common approach of splitting the data is the
idea of holding out training, validation, and test datasets.
◦ The training set is usually allotted for the model to learn
from the data, the validation set is primarily designed for
hyperparameter tuning purposes, and the test set is to
evaluate the performance of the model with chosen
hyperparameters.
◦ There is also k-fold cross validation where the dataset is
partitioned into k chunks and iteratively trained by
keeping k – 1 as training and the remaining as the test
set.

23
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Model Selection
◦ The next step is to start looking at selecting the right
machine learning model. It is also a good stage to
decide how important it is to tell a story of target
variables based on variables entering into the model.
◦ In case of predicting a customer churn, it may be
relevant to tell a story that is based on few variables
like demographics, preferences, etc., whereas in case
of image classification, one is more focused on
whether the picture is a cat or a dog and less
concerned about how the picture came about to be
the one or the other.
◦ In cases of supervised learning, one would usually
start off with logistic regression as a base point and
move toward ensemble methods like bootstrapping or
stacking methods.
24
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Model Training
◦ Based on the characteristics of data and nature and
context of the problem we are trying to solve using
the machine learning model, the training strategy and
algorithm vary. The common training strategies are
supervised and unsupervised learning.
◦ Supervised learning is the process of training a model
with a labeled dataset, where data points have
specific names and usually the ones with higher
correlations with the decision variables are picked to
train.
◦ Unsupervised learning is the process of training with
an unlabeled dataset. A good example is the
clustering algorithm that groups similar data points
based on features like geographical proximity and so
on.
25
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
Model Evaluation
◦ Once we have a machine learning model
trained, the next step is model validation.
◦ Sometimes testing alone may suffice, and
in other cases testing and validation are
required.
◦ The idea is to evaluate the accuracy of a
model by making a decision on a
completely new set of data and evaluate
its accuracy statistically by looking at
precision metrics like r-squared, F1-score,
precision, recall, etc.
26
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
 Hyperparameter Tuning
 Hyperparameter tuning is the process of
optimizing the configuration settings
(hyperparameters) of a machine learning model
to improve its performance on a specific task.
 Hyperparameters are parameters set before the
learning process begins, unlike model parameters
(like weights) that the model learns during
training. Examples include:

27
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
Final Testing
◦ Final testing is basically testing the final
chosen model on a held-out test set. This
process exists in order to obtain realistic
expectations when the model is deployed in
production.
◦ The main goal behind this step is to get an
idea of how this model may perform when
exposed to a new set of data.
◦ Once we test out the final model, establish an
understanding of strengths and weaknesses
of the model and document the results for
future reference and maintaining audit trails.
28
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
Model Deployment
◦ Model deployment is the process of making a
trained machine learning model available to
users or other systems in a production
environment so it can make predictions on
new data.
Model Monitoring
◦ Once the model is deployed, then the model
needs to be monitored in frequent time
intervals to see if the model is performing
appropriately or not. The model also requires
frequent maintenance every so often.

29
2025 Dr. Ibrahim Al-Baltah
Data engineering tasks for machine
learning…
Model Retraining
◦ Model retraining is a crucial step in the
machine learning model process. This
step involves sourcing the new data that
has been collected since the last time
and performing various data wrangling
and feature engineering processes.
◦ Based on the dataset that is generated,
it will then be decided as to whether
incremental training may suffice or a
complete model overhaul may be
required.
30
2025 Dr. Ibrahim Al-Baltah
Resources
https://
www.ibm.com/think/topics/data-e
ngineering
Pavan Kumar Narayanan, 2024,
Data Engineering for Machine
Learning Pipelines: From Python
Libraries to ML Pipelines and
Cloud Platforms, Apress.

31
2025 Dr. Ibrahim Al-Baltah

You might also like