0% found this document useful (0 votes)
219 views

CH1 - Introduction To Data Engineering

The document discusses the role of data engineering and what data engineers do. It covers topics like ETL processes, data extraction, transformation and loading, different data storage options including relational databases, non-relational databases and data lakes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views

CH1 - Introduction To Data Engineering

The document discusses the role of data engineering and what data engineers do. It covers topics like ETL processes, data extraction, transformation and loading, different data storage options including relational databases, non-relational databases and data lakes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Chapter 1

Introduction to Data Engineering


References

© Ibrahim Abu alhaol, Bushra Alhijawi 2


What is Data Engineering
Data engineering is the process of designing, building, and maintaining the
infrastructure and systems that collect, store, and process data.

© Ibrahim Abu alhaol, Bushra Alhijawi 3


Data Engineering Today
• Data Engineering → a critical
component of data science and
analytics.
• provides the foundation for data-driven
decision-making.

• With the GROWING amount of data Unit Abbr Size (Decimal)


businesses generates, data engineering Kilobyte Kb 103
has become an increasingly important Megabyte Mb 106
field. Gigabyte Gb 109
Terabyte Tb 1012
Petabyte Pb 1015
Exabyte Eb 1018
Zettabyte Zb 1021
© Ibrahim Abu alhaol, Bushra Alhijawi Yottabyte Yb 1024 4
Data Engineering Today

© Ibrahim Abu alhaol, Bushra Alhijawi Source 5


Data Engineering Today
Data engineers→ database administrators (DBAs), data analysts, business
intelligence engineers, and database developers.
The growth in data production has led to major advancements in data
technology and data jobs, especially in the fields of data engineering and data
science.

Today Data engineering is about the movement, manipulation, and management


of data.

© Ibrahim Abu alhaol, Bushra Alhijawi 6


Data Engineering Trend

© Ibrahim Abu alhaol, Bushra Alhijawi Source 7


Data Engineering Trend
• Topics like machine learning and AI will always win a popularity contest, especially
as they continue to appear in the mainstream media.
• A good chunk of the work behind these concepts stems from data engineering
work, such as data mining, data manipulation and cleaning, string manipulation,
aggregation, joining to other data sets, and building deployable machine learning
pipelines.

© Ibrahim Abu alhaol, Bushra Alhijawi 8


Data Science vs Data Engineering
• Data Science is the field of study that
combines domain expertise,
programming skills, and knowledge of
mathematics and statistics to extract
meaningful insights from data.

• Data engineering deals with various


data formats, storage, data extraction,
and transformation.

© Ibrahim Abu alhaol, Bushra Alhijawi 9


Data Science vs Data Engineering

Source

© Ibrahim Abu alhaol, Bushra Alhijawi 10


Data Scientist vs Data Engineer
What are the differences between Data scientists and Data engineers?

Data Scientist Data Engineer


• More mathematical. • More technical with solid data
• Utilize data for building statistical models warehousing and programming
and mathematical computation. backgrounds.
• Their models incorporated into a data • Need to understand data formats, models,
engineering pipeline. and structures to transport data efficiently.
• Connect to the data warehouses built by • Build data engineering pipeline.
data engineers.

© Ibrahim Abu alhaol, Bushra Alhijawi 11


What Data Engineers Do
• ETL (extract, transform, and load) systems.
• Data engineering involves the movement of data from one system or format
to another system or format.
• Data engineers query data from a source (EXTRACT), perform some
modifications to the data (TRANSFORM) and then put that data in a location
where users can access it and know that it is production quality (LOAD).

© Ibrahim Abu alhaol, Bushra Alhijawi 12


Data Extraction
• Extract Data → Getting Data
• Data is copied or exported from source locations to a staging area.
• The data can come from any structured or unstructured source—SQL or NoSQL
servers, CRM and ERP systems, text and document files, emails, web pages, and
more.
• Data engineers dedicate a lot of their time to pulling data sets from an array of
sources, most of the time into a central hub, a data warehouse, to be utilized
together.

© Ibrahim Abu alhaol, Bushra Alhijawi 13


Data Transformation
• 80% of data science work is data preparation; 75% of data scientists
find this the most boring aspect of the job.
• Raw data is transformed to be useful for analysis and to fit the
schema of the eventual target data warehouse.
• Data engineer brings their skill in manipulating data to a project, such
as:
• Filtering, cleansing, de-duplicating, validating and authenticating the data.
• Performing calculations, translations, or summaries based on the raw data.
• Formatting the data into tables or joined tables to match the schema of the
target data warehouse.

© Ibrahim Abu alhaol, Bushra Alhijawi 14


Data Loading
• Load is one of the more basic data engineering job functions because it is just the
movement and storage of data (Store data to the target location).
• Data Engineers sometimes swap the load and transform steps around to be ( ELT)
when dealing with big data technologies such as Hadoop/Spark because the
extraction process is cheaper to run and spreads the processing burden across
multiple machines (cluster).
• Data engineers have the experience to choose from many available data storage
options in the cloud and on-premise, including NoSQL databases, relational data
stores, and data warehouses (Data lake).

© Ibrahim Abu alhaol, Bushra Alhijawi 15


What Data Engineers Do
• Data storage.
• Data engineers involve in data storage selection from several available
cloud-based and on-premise options.
• Relational and non-relational databases. Databases rank among the most
common solutions for data storage. Data engineers should be familiar with
relational and non-relational databases and how they work.
• Data lake is a centralized repository that allows for storing structured and
unstructured data at any scale.
• Cloud computing → separates storage and computational machines,
meaning you can scale down (switch off) expensive machines used to
process data without affecting the stored data.

© Ibrahim Abu alhaol, Bushra Alhijawi 16


Relational Databases
Relational database is a collection of data items with pre-defined
relationships between them.
• These items are organized as a set of tables with columns and rows.
• Tables are used to hold information about objects.

Credit:
guru99.com

© Ibrahim Abu alhaol, Bushra Alhijawi 17


Non-relational Databases
Non-relational database is a database that does not use the tabular
schema of rows and columns. NoSQL stands for not only SQL. Google’s
Bigtable is an example of a NoSQL data store built on the Google File
System (GFS).
• Documents
• Semi-structured data
• Large and unstructured data → Results from the internet of things
(IoT), social networks, and the rise of AI.
• NoSQL databases have been popular because distributed
technologies and file systems (including Hadoop/Spark) became
more accessible for storing petabytes of data.
• Distributed technology is a network of machines called a cluster, and
each machine is referred to as a node.

© Ibrahim Abu alhaol, Bushra Alhijawi 18


Relational vs Non-relational Databases

Relational Database Non-relational Database


• Use a structured schema to organize data • Use a more flexible schema-less structure
into tables with rows and columns. to store data in documents or key-value
• Use SQL (Structured Query Language) to pairs.
access and manipulate the data. • Use their query languages.
• Relationships between tables are • Do not rely on foreign keys to establish
established using foreign keys. relationships between data.
• Ex. MySQL and PostgreSQL. • Used for storing large amounts of
unstructured data.
• Ex., MongoDB and Cassandra.

© Ibrahim Abu alhaol, Bushra Alhijawi 19


Data Warehouse
• Data warehouse is a large, centralized repository of data designed to support
business intelligence activities, such as reporting and data analysis.

• Data is extracted from various transactional systems, transformed to fit a


common data model, and then loaded into the warehouse for reporting and
analysis.

• Data warehousing technologies can be implemented with relational databases,


columnar databases or cloud-based services like Amazon Redshift, Snowflake and
Google BigQuery.

© Ibrahim Abu alhaol, Bushra Alhijawi 20


Data Lake
• Data lake is a centralized, shared repository that allows storing all structured and
unstructured data at any scale.

• Designed to handle large amounts of data, including raw, detailed, and historical
data, and to support a wide variety of data processing and analysis tasks, such as
batch processing, real-time streaming, interactive querying, dashboards, and
visualizations of data.
• Store data in its native format, WITHOUT having to fit a data model.

• Data lake technologies can be implemented on-premises or on cloud-based


services like AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

© Ibrahim Abu alhaol, Bushra Alhijawi 21


Data Warehouse vs Data Lake
Characteristics Data Warehouse Data Lake
Relational from transactional systems, Non-relational and relational from IoT devices,
Data operational databases, and line of business web sites, mobile apps, social media, and
applications corporate applications
Designed prior to the implementation Written at the time of analysis
Schema
(schema-on-write) (schema-on-read)
Query results getting faster using low-cost
Price/Performance Fastest query results using higher cost storage
storage
Highly curated data that serves as the central Any data that may or may
Data Quality
version of the truth not be curated (i.e., raw data)

Data scientists, Data developers,


Users Business analysts
and Business analysts (using curated data)

Machine Learning, Predictive analytics,


Analytics Batch reporting, BI and visualizations
data discovery and profiling
Source
© Ibrahim Abu alhaol, Bushra Alhijawi 22
Data Storage Trend

© Ibrahim Abu alhaol, Bushra Alhijawi 23


Data Storage Trend

© Ibrahim Abu alhaol, Bushra Alhijawi 24


What Data Engineers Do
• Building data systems and pipelines.
• The combination of extracting, loading, and transforming data is
accomplished by creating a data pipeline.
• Data pipeline → the design of systems for processing and storing data. These
systems capture, cleanse, transform and route data to destination systems.
Data engineers build data pipelines that enable the organization to collect
data points from millions of users and process the results in near real-time.

© Ibrahim Abu alhaol, Bushra Alhijawi 25


A pipeline that adds a location and modifies the date.
Data Pipeline Optimization
• Each of the ETL steps can be optimized:
• Increasing the speed of data extraction.
• Transformations can be optimized to improve join speeds to other tables and
increase the performance of complex calculations or string manipulations.
• Data loads can be optimized by increasing the speed at which data can be stored in
the data store of choice or improving how the data is stored to be more effective for
the end user.
• These optimizations are specific to the technology and final use case.
• The optimization is NOT a one-time task but an ongoing process that requires
regular monitoring and tuning to maintain performance as the data and
workloads change.

© Ibrahim Abu alhaol, Bushra Alhijawi 26


What Data Engineers Do
• Evaluating business needs and objectives
• To make raw data useful to the organization, data engineers must understand
business objectives.
• Data engineers should understand business requirements and where data fits
into the business model so they can build a data ecosystem that serves the
organization’s needs.

• Building algorithms and prototypes


• Data pipelines represent an automated set of actions that extract data from various
sources for analysis and visualization. These processes are powered by algorithms

© Ibrahim Abu alhaol, Bushra Alhijawi 27


What Data Engineers Do
• Interpreting trends and patterns
• Performing complex data analysis to find trends and patterns and reporting on the results in
the form of dashboards, reports, and data visualizations

• Preparing data for prescriptive and predictive modeling


• Data engineers must ensure the data is complete (no missing values), has been cleansed, and
that rules have been established for outliers (eliminate, ignore, average out, and so on)

• Developing analytical tools and programs

© Ibrahim Abu alhaol, Bushra Alhijawi 28


Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role,
so consider taking courses to learn and practice your skills.
Common programming languages include SQL, NoSQL, Python,
Java, R, and Scala.

Data Storage. Selecting suitable data storage based on the


business needs. Data engineers should be familiar with relational
database and non-relational database, data warehouse, and data
lake and how they work.

© Ibrahim Abu alhaol, Bushra Alhijawi 29


Data Engineering Skills
Data Pipeline. Data engineers should be able to build efficient
data pipeline and system.

Machine learning. While machine learning is more the concern


of data scientists, it can be helpful to grasp the basic concepts
better to understand the needs of data scientists on the work
team.

© Ibrahim Abu alhaol, Bushra Alhijawi 30


Data is the Backbone of AI and ML
• As the data scientists build a model,
multiple iterations of data ingestion
and preparation will be cycled
through.
• As the data scientists ask more
questions and require extra data for
answers.
• The Data engineers and scientists
could turn the code written to build
the model into a little service that
can be asked to run some data
against the model or
rebuild/enhance the model using
new data.

© Ibrahim Abu alhaol, Bushra Alhijawi 31


Data Engineering Skills
Big data tools. Data engineers are tasked with managing big
data. Tools and technologies vary by company, but some popular
ones include Hadoop, MongoDB, and Kafka.

Cloud computing. Data engineers must understand cloud


computing.

Data security. Data engineers and security teams are tasked with
securely managing and storing data to protect it from loss or
theft.
© Ibrahim Abu alhaol, Bushra Alhijawi 32
Data Engineering Tools
• Programming languages.
• SQL, Java, Python, R.

• Databases.
• Relational DB: MySQL, ORACLE, PostgreSQL.
• NoSQL DB: MongoDB, Apache Cassandra, Elasticsearch.

• Data processing engines.


• Data processing engines allow the parallel execution of transformation tasks.
• Apache Spark.

© Ibrahim Abu alhaol, Bushra Alhijawi 33


Data Engineering Tools
• Data pipelines.
• Combining a transactional database, a programming language, a processing
engine, and a data warehouse results in a pipeline.
• Data pipelines need a scheduler to allow them to run at specified intervals.
• Apache Airflow & Apache NiFi → workflow management platforms.
• Apache NiFi is designed to handle data in motion and provide data integration.
• Apache Airflow is designed to handle data pipelines and workflow scheduling.
Both can be used together to achieve a complete data integration and management solution.

© Ibrahim Abu alhaol, Bushra Alhijawi 34


Any Question
www.psut.edu.jo
Call: (+962) 6-5359 949
Fax: (+962) 6-5347 295
Email: [email protected]
Princess Sumaya University for Technology
Amman 11941 Jordan
P.o.Box 1438 Al-Jubaiha

You might also like