Lecture 13 - Data Warehousing
Lecture 13 - Data Warehousing
Advanced Databases
Lecture 12
Data Warehousing
1
Definition
Data Warehouse:
A subject-oriented, integrated, time-variant, non-
updatable collection of data used in support of
management decision-making processes
Subject-oriented: e.g. customers, patients,
students, products
Integrated: Consistent naming conventions,
formats, encoding structures; from multiple data
sources
Time-variant: Can study trends and changes
Non-updatable: Read-only, periodically
refreshed
2
Need for Data Warehousing
Integrated, company-wide view of high-quality
information (from disparate databases)
Separation of operational and informational systems
and data (for improved performance)
3
The ETL Process
Capture/Extract
Transform
4
The ETL Process
it is a process used in data warehousing to extract data from
various sources, transform it into a format suitable for loading
into a data warehouse, and then load it into the warehouse.
5
Static extract = capturing a Incremental extract =
snapshot of the source data at a point capturing changes that have
in time occurred since the last static extract6
Extract Step
The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from
the source systems and storing it in a staging area.
7
Transform Step
In this stage, the extracted data is transformed into a format that is suitable for loading
into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.
8
Fixing errors: misspellings, Also: decoding, reformatting, time
erroneous dates, incorrect field usage, stamping, conversion, key generation,
mismatched addresses, missing data, merging, error detection/logging, locating
9
duplicate data, inconsistencies missing data
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
10
Aggregation – data summarization one field to many
Load Step
After the data is transformed, it is
loaded into the data warehouse. This
step involves creating the physical data
structures and loading the data into the
warehouse.
13
Data Mart
The data mart is a subset of
the data warehouse that is usually
oriented to a specific business line or
team.
Data marts are small slices of the data
warehouse. Whereas data warehouses
have an enterprise-wide depth, the
information in data marts is attached to
a single department.
14
Independent data mart Data marts:
Mini-warehouses, limited in scope
T
E
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
16
ODS and data warehouse
are one and the same
T
E
Near real-time ETL for
Data marts are NOT separate
@active Data Warehouse
databases, but logical views of the
data warehouse
17
Easier to create new data marts
Data warehouse layers
Source layer
The logical layer of all systems of record
(SOR) that feed data into the warehouse. They
could include point-of-sale, marketing
automation, CRM, or ERP systems. Each
source SOR has a specific data format and
may require a different data capture method
based on that data format.
18
Data warehouse layers
Staging layer
A landing area for data from the source SOR.
A data staging best practice is to ingest data
from the SOR without applying business logic
or transformations. It’s also critical to ensure
that staging data is not used in production data
analysis; data in the staging area has yet to be
cleansed, standardized, modeled, governed,
and verified.
19
Data warehouse layers
Warehouse layer
The layer where all of the data is stored. The
warehouse data is now subject-oriented,
integrated, time-variant, and non-volatile. This
layer will have the physical schemas, tables,
views, stored procedures, and functions
needed to access the warehouse-modeled data.
20
Data warehouse layers
End-user layer
Also known as the analytics layer, is where
you model data for consumption using
analytics tools like ThoughtSpot, data analysts,
data scientists, and business users.
21
Generic two-level architecture
L
One company-
wide warehouse
T
23
24
On-Line Analytical Processing (OLAP)
OLAP Operations
Dice slicing – Dice. This allows an analyst
25
Slicing a data cube
26
Summary report
Example:
Drill-down
27