0% found this document useful (0 votes)

244 views36 pages

CH1 - Introduction To Data Engineering

The document discusses the role of data engineering and what data engineers do. It covers topics like ETL processes, data extraction, transformation and loading, different data storage options including relational databases, non-relational databases and data lakes.

Uploaded by

إيمان محمد

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

244 views36 pages

CH1 - Introduction To Data Engineering

Uploaded by

إيمان محمد

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Chapter 1

Introduction to Data Engineering

References

© Ibrahim Abu alhaol, Bushra Alhijawi 2

What is Data Engineering
Data engineering is the process of designing, building, and maintaining the
infrastructure and systems that collect, store, and process data.

© Ibrahim Abu alhaol, Bushra Alhijawi 3

Data Engineering Today
• Data Engineering → a critical
component of data science and
analytics.
• provides the foundation for data-driven
decision-making.

• With the GROWING amount of data Unit Abbr Size (Decimal)

businesses generates, data engineering Kilobyte Kb 103
has become an increasingly important Megabyte Mb 106
field. Gigabyte Gb 109
Terabyte Tb 1012
Petabyte Pb 1015
Exabyte Eb 1018
Zettabyte Zb 1021
© Ibrahim Abu alhaol, Bushra Alhijawi Yottabyte Yb 1024 4
Data Engineering Today

© Ibrahim Abu alhaol, Bushra Alhijawi Source 5

Data Engineering Today
Data engineers→ database administrators (DBAs), data analysts, business
intelligence engineers, and database developers.
The growth in data production has led to major advancements in data
technology and data jobs, especially in the fields of data engineering and data
science.

Today Data engineering is about the movement, manipulation, and management

of data.

© Ibrahim Abu alhaol, Bushra Alhijawi 6

Data Engineering Trend

© Ibrahim Abu alhaol, Bushra Alhijawi Source 7

Data Engineering Trend
• Topics like machine learning and AI will always win a popularity contest, especially
as they continue to appear in the mainstream media.
• A good chunk of the work behind these concepts stems from data engineering
work, such as data mining, data manipulation and cleaning, string manipulation,
aggregation, joining to other data sets, and building deployable machine learning
pipelines.

© Ibrahim Abu alhaol, Bushra Alhijawi 8

Data Science vs Data Engineering
• Data Science is the field of study that
combines domain expertise,
programming skills, and knowledge of
mathematics and statistics to extract
meaningful insights from data.

• Data engineering deals with various

data formats, storage, data extraction,
and transformation.

© Ibrahim Abu alhaol, Bushra Alhijawi 9

Data Science vs Data Engineering

Source

© Ibrahim Abu alhaol, Bushra Alhijawi 10

Data Scientist vs Data Engineer
What are the differences between Data scientists and Data engineers?

Data Scientist Data Engineer

• More mathematical. • More technical with solid data
• Utilize data for building statistical models warehousing and programming
and mathematical computation. backgrounds.
• Their models incorporated into a data • Need to understand data formats, models,
engineering pipeline. and structures to transport data efficiently.
• Connect to the data warehouses built by • Build data engineering pipeline.
data engineers.

© Ibrahim Abu alhaol, Bushra Alhijawi 11

What Data Engineers Do
• ETL (extract, transform, and load) systems.
• Data engineering involves the movement of data from one system or format
to another system or format.
• Data engineers query data from a source (EXTRACT), perform some
modifications to the data (TRANSFORM) and then put that data in a location
where users can access it and know that it is production quality (LOAD).

© Ibrahim Abu alhaol, Bushra Alhijawi 12

Data Extraction
• Extract Data → Getting Data
• Data is copied or exported from source locations to a staging area.
• The data can come from any structured or unstructured source—SQL or NoSQL
servers, CRM and ERP systems, text and document files, emails, web pages, and
more.
• Data engineers dedicate a lot of their time to pulling data sets from an array of
sources, most of the time into a central hub, a data warehouse, to be utilized
together.

© Ibrahim Abu alhaol, Bushra Alhijawi 13

Data Transformation
• 80% of data science work is data preparation; 75% of data scientists
find this the most boring aspect of the job.
• Raw data is transformed to be useful for analysis and to fit the
schema of the eventual target data warehouse.
• Data engineer brings their skill in manipulating data to a project, such
as:
• Filtering, cleansing, de-duplicating, validating and authenticating the data.
• Performing calculations, translations, or summaries based on the raw data.
• Formatting the data into tables or joined tables to match the schema of the
target data warehouse.

© Ibrahim Abu alhaol, Bushra Alhijawi 14

Data Loading
• Load is one of the more basic data engineering job functions because it is just the
movement and storage of data (Store data to the target location).
• Data Engineers sometimes swap the load and transform steps around to be ( ELT)
when dealing with big data technologies such as Hadoop/Spark because the
extraction process is cheaper to run and spreads the processing burden across
multiple machines (cluster).
• Data engineers have the experience to choose from many available data storage
options in the cloud and on-premise, including NoSQL databases, relational data
stores, and data warehouses (Data lake).

© Ibrahim Abu alhaol, Bushra Alhijawi 15

What Data Engineers Do
• Data storage.
• Data engineers involve in data storage selection from several available
cloud-based and on-premise options.
• Relational and non-relational databases. Databases rank among the most
common solutions for data storage. Data engineers should be familiar with
relational and non-relational databases and how they work.
• Data lake is a centralized repository that allows for storing structured and
unstructured data at any scale.
• Cloud computing → separates storage and computational machines,
meaning you can scale down (switch off) expensive machines used to
process data without affecting the stored data.

© Ibrahim Abu alhaol, Bushra Alhijawi 16

Relational Databases
Relational database is a collection of data items with pre-defined
relationships between them.
• These items are organized as a set of tables with columns and rows.
• Tables are used to hold information about objects.

Credit:
guru99.com

© Ibrahim Abu alhaol, Bushra Alhijawi 17

Non-relational Databases
Non-relational database is a database that does not use the tabular
schema of rows and columns. NoSQL stands for not only SQL. Google’s
Bigtable is an example of a NoSQL data store built on the Google File
System (GFS).
• Documents
• Semi-structured data
• Large and unstructured data → Results from the internet of things
(IoT), social networks, and the rise of AI.
• NoSQL databases have been popular because distributed
technologies and file systems (including Hadoop/Spark) became
more accessible for storing petabytes of data.
• Distributed technology is a network of machines called a cluster, and
each machine is referred to as a node.

© Ibrahim Abu alhaol, Bushra Alhijawi 18

Relational vs Non-relational Databases

Relational Database Non-relational Database

• Use a structured schema to organize data • Use a more flexible schema-less structure
into tables with rows and columns. to store data in documents or key-value
• Use SQL (Structured Query Language) to pairs.
access and manipulate the data. • Use their query languages.
• Relationships between tables are • Do not rely on foreign keys to establish
established using foreign keys. relationships between data.
• Ex. MySQL and PostgreSQL. • Used for storing large amounts of
unstructured data.
• Ex., MongoDB and Cassandra.

© Ibrahim Abu alhaol, Bushra Alhijawi 19

Data Warehouse
• Data warehouse is a large, centralized repository of data designed to support
business intelligence activities, such as reporting and data analysis.

• Data is extracted from various transactional systems, transformed to fit a

common data model, and then loaded into the warehouse for reporting and
analysis.

• Data warehousing technologies can be implemented with relational databases,

columnar databases or cloud-based services like Amazon Redshift, Snowflake and
Google BigQuery.

© Ibrahim Abu alhaol, Bushra Alhijawi 20

Data Lake
• Data lake is a centralized, shared repository that allows storing all structured and
unstructured data at any scale.

• Designed to handle large amounts of data, including raw, detailed, and historical
data, and to support a wide variety of data processing and analysis tasks, such as
batch processing, real-time streaming, interactive querying, dashboards, and
visualizations of data.
• Store data in its native format, WITHOUT having to fit a data model.

• Data lake technologies can be implemented on-premises or on cloud-based

services like AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

© Ibrahim Abu alhaol, Bushra Alhijawi 21

Data Warehouse vs Data Lake
Characteristics Data Warehouse Data Lake
Relational from transactional systems, Non-relational and relational from IoT devices,
Data operational databases, and line of business web sites, mobile apps, social media, and
applications corporate applications
Designed prior to the implementation Written at the time of analysis
Schema
(schema-on-write) (schema-on-read)
Query results getting faster using low-cost
Price/Performance Fastest query results using higher cost storage
storage
Highly curated data that serves as the central Any data that may or may
Data Quality
version of the truth not be curated (i.e., raw data)

Data scientists, Data developers,

Users Business analysts
and Business analysts (using curated data)

Machine Learning, Predictive analytics,

Analytics Batch reporting, BI and visualizations
data discovery and profiling
Source
© Ibrahim Abu alhaol, Bushra Alhijawi 22
Data Storage Trend

© Ibrahim Abu alhaol, Bushra Alhijawi 23

Data Storage Trend

What Data Engineers Do
• Building data systems and pipelines.
• The combination of extracting, loading, and transforming data is
accomplished by creating a data pipeline.
• Data pipeline → the design of systems for processing and storing data. These
systems capture, cleanse, transform and route data to destination systems.
Data engineers build data pipelines that enable the organization to collect
data points from millions of users and process the results in near real-time.

A pipeline that adds a location and modifies the date.
Data Pipeline Optimization
• Each of the ETL steps can be optimized:
• Increasing the speed of data extraction.
• Transformations can be optimized to improve join speeds to other tables and
increase the performance of complex calculations or string manipulations.
• Data loads can be optimized by increasing the speed at which data can be stored in
the data store of choice or improving how the data is stored to be more effective for
the end user.
• These optimizations are specific to the technology and final use case.
• The optimization is NOT a one-time task but an ongoing process that requires
regular monitoring and tuning to maintain performance as the data and
workloads change.

What Data Engineers Do
• Evaluating business needs and objectives
• To make raw data useful to the organization, data engineers must understand
business objectives.
• Data engineers should understand business requirements and where data fits
into the business model so they can build a data ecosystem that serves the
organization’s needs.

• Building algorithms and prototypes

• Data pipelines represent an automated set of actions that extract data from various
sources for analysis and visualization. These processes are powered by algorithms

What Data Engineers Do
• Interpreting trends and patterns
• Performing complex data analysis to find trends and patterns and reporting on the results in
the form of dashboards, reports, and data visualizations

• Preparing data for prescriptive and predictive modeling

• Data engineers must ensure the data is complete (no missing values), has been cleansed, and
that rules have been established for outliers (eliminate, ignore, average out, and so on)

• Developing analytical tools and programs

Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role,
so consider taking courses to learn and practice your skills.
Common programming languages include SQL, NoSQL, Python,
Java, R, and Scala.

Data Storage. Selecting suitable data storage based on the

business needs. Data engineers should be familiar with relational
database and non-relational database, data warehouse, and data
lake and how they work.

Data Engineering Skills
Data Pipeline. Data engineers should be able to build efficient
data pipeline and system.

Machine learning. While machine learning is more the concern

of data scientists, it can be helpful to grasp the basic concepts
better to understand the needs of data scientists on the work
team.

Data is the Backbone of AI and ML
• As the data scientists build a model,
multiple iterations of data ingestion
and preparation will be cycled
through.
• As the data scientists ask more
questions and require extra data for
answers.
• The Data engineers and scientists
could turn the code written to build
the model into a little service that
can be asked to run some data
against the model or
rebuild/enhance the model using
new data.

Data Engineering Skills
Big data tools. Data engineers are tasked with managing big
data. Tools and technologies vary by company, but some popular
ones include Hadoop, MongoDB, and Kafka.

Cloud computing. Data engineers must understand cloud

computing.

Data security. Data engineers and security teams are tasked with
securely managing and storing data to protect it from loss or
theft.
© Ibrahim Abu alhaol, Bushra Alhijawi 32
Data Engineering Tools
• Programming languages.
• SQL, Java, Python, R.

• Databases.
• Relational DB: MySQL, ORACLE, PostgreSQL.
• NoSQL DB: MongoDB, Apache Cassandra, Elasticsearch.

• Data processing engines.

• Data processing engines allow the parallel execution of transformation tasks.
• Apache Spark.

Data Engineering Tools
• Data pipelines.
• Combining a transactional database, a programming language, a processing
engine, and a data warehouse results in a pipeline.
• Data pipelines need a scheduler to allow them to run at specified intervals.
• Apache Airflow & Apache NiFi → workflow management platforms.
• Apache NiFi is designed to handle data in motion and provide data integration.
• Apache Airflow is designed to handle data pipelines and workflow scheduling.
Both can be used together to achieve a complete data integration and management solution.

Any Question
www.psut.edu.jo
Call: (+962) 6-5359 949
Fax: (+962) 6-5347 295
Email: [email protected]
Princess Sumaya University for Technology
Amman 11941 Jordan
P.o.Box 1438 Al-Jubaiha

M1 - Introduction To Data Engineering Slides
No ratings yet
M1 - Introduction To Data Engineering Slides
62 pages
Big Data Lecture Notes
No ratings yet
Big Data Lecture Notes
140 pages
Python Data Structures Cheat Sheet
No ratings yet
Python Data Structures Cheat Sheet
10 pages
Computer Programming Syllabus 2
No ratings yet
Computer Programming Syllabus 2
9 pages
Teradata Tutorial PDF
100% (1)
Teradata Tutorial PDF
120 pages
SQL For Data Analytics
No ratings yet
SQL For Data Analytics
92 pages
Data Structures and Algorithms With Python LetsUpgrade
No ratings yet
Data Structures and Algorithms With Python LetsUpgrade
11 pages
Java LabBook
No ratings yet
Java LabBook
78 pages
Hands-On Lab: Views in Postgresql
No ratings yet
Hands-On Lab: Views in Postgresql
28 pages
Searching and Sorting Programs
No ratings yet
Searching and Sorting Programs
11 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
96 pages
Dam301 Data Mining and Data Warehousing Summary 08024665051
No ratings yet
Dam301 Data Mining and Data Warehousing Summary 08024665051
48 pages
Elective-II Soft Computing PDF
100% (1)
Elective-II Soft Computing PDF
3 pages
Operating System Tutorial
100% (1)
Operating System Tutorial
72 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Python for Data Engineering Guide
No ratings yet
Python for Data Engineering Guide
4 pages
Big Data and Analytics Syllabus 2021
No ratings yet
Big Data and Analytics Syllabus 2021
3 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Rule-Based Classification (1)
No ratings yet
Rule-Based Classification (1)
43 pages
Lab Manual B.tech IT (FACULTY COPY) - CN
No ratings yet
Lab Manual B.tech IT (FACULTY COPY) - CN
41 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
261 pages
Programming Fundamental All Chapter
100% (1)
Programming Fundamental All Chapter
265 pages
Unit 1
No ratings yet
Unit 1
61 pages
Cba 8 Clinical Decision Support System: Capstone Project
No ratings yet
Cba 8 Clinical Decision Support System: Capstone Project
31 pages
Database Management System
No ratings yet
Database Management System
77 pages
03 Searching and Sorting
No ratings yet
03 Searching and Sorting
19 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Introduction To RDBMS Participant Guide
No ratings yet
Introduction To RDBMS Participant Guide
182 pages
Big Data
No ratings yet
Big Data
13 pages
ENCh 09
No ratings yet
ENCh 09
45 pages
0 All Slides PFIEV2017
No ratings yet
0 All Slides PFIEV2017
768 pages
Extending Xampp With Postgresql and Phppgadmin: This Paper'S Intention
No ratings yet
Extending Xampp With Postgresql and Phppgadmin: This Paper'S Intention
14 pages
Introduction To Python Slides
No ratings yet
Introduction To Python Slides
72 pages
Apache Mahout Essentials - Sample Chapter
No ratings yet
Apache Mahout Essentials - Sample Chapter
25 pages
Chapter Three: Theoretical Framework
No ratings yet
Chapter Three: Theoretical Framework
56 pages
Decision Making in C
100% (1)
Decision Making in C
16 pages
DataMining Course Handout PDF
No ratings yet
DataMining Course Handout PDF
5 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Principles of Programming Languages: Dr. N. Papanna
No ratings yet
Principles of Programming Languages: Dr. N. Papanna
375 pages
Brief 1.2 - Making Images, Making Meaning
No ratings yet
Brief 1.2 - Making Images, Making Meaning
3 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Association Rule Mining Lesson PDF
No ratings yet
Association Rule Mining Lesson PDF
9 pages
01 Naiv Bayes
No ratings yet
01 Naiv Bayes
25 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
Talend Data Integration Advanced
No ratings yet
Talend Data Integration Advanced
2 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Tutorials Point, Uml
No ratings yet
Tutorials Point, Uml
37 pages
Insertion Sort
No ratings yet
Insertion Sort
33 pages
Diploma in Data Science Online Training Content by MR Navin NareshIT Modified
No ratings yet
Diploma in Data Science Online Training Content by MR Navin NareshIT Modified
10 pages
Lect1 Intro To Java
No ratings yet
Lect1 Intro To Java
58 pages
Python Pyramid Program
No ratings yet
Python Pyramid Program
4 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
Dbms Lab Manual
50% (2)
Dbms Lab Manual
99 pages
Data Literacy Fundamentals: Understanding the Power & Value of Data
From Everand
Data Literacy Fundamentals: Understanding the Power & Value of Data
Ben Jones
No ratings yet
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
From Everand
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Pablo Alejandro Echeverria Barrios
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet