0% found this document useful (0 votes)
43 views27 pages

Lecture 13 - Data Warehousing

The document discusses data warehousing and the ETL process. It defines a data warehouse as a subject-oriented, integrated collection of data used for management decision making. The ETL process extracts data from sources, transforms it into a suitable format, and loads it into the data warehouse. This involves cleaning, transforming, and indexing the data. OLAP tools then allow users to analyze and get insights from the data through slicing, drilling down, and generating summary reports.

Uploaded by

Hassan Elbayya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views27 pages

Lecture 13 - Data Warehousing

The document discusses data warehousing and the ETL process. It defines a data warehouse as a subject-oriented, integrated collection of data used for management decision making. The ETL process extracts data from sources, transforms it into a suitable format, and loads it into the data warehouse. This involves cleaning, transforming, and indexing the data. OLAP tools then allow users to analyze and get insights from the data through slicing, drilling down, and generating summary reports.

Uploaded by

Hassan Elbayya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

S 325

Advanced Databases

Lecture 12
Data Warehousing

1
Definition
 Data Warehouse:
 A subject-oriented, integrated, time-variant, non-
updatable collection of data used in support of
management decision-making processes
 Subject-oriented: e.g. customers, patients,
students, products
 Integrated: Consistent naming conventions,
formats, encoding structures; from multiple data
sources
 Time-variant: Can study trends and changes
 Non-updatable: Read-only, periodically
refreshed

2
Need for Data Warehousing
 Integrated, company-wide view of high-quality
information (from disparate databases)
 Separation of operational and informational systems
and data (for improved performance)

3
The ETL Process

 Capture/Extract

 Transform

 Load and Index

4
The ETL Process
 it is a process used in data warehousing to extract data from
various sources, transform it into a format suitable for loading
into a data warehouse, and then load it into the warehouse.

5
Static extract = capturing a Incremental extract =
snapshot of the source data at a point capturing changes that have
in time occurred since the last static extract6
Extract Step
The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from
the source systems and storing it in a staging area.

7
Transform Step
In this stage, the extracted data is transformed into a format that is suitable for loading
into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.

It may involve following processes/tasks:


1- Filtering – loading only certain attributes into the data warehouse.
2-Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
• 3-Joining – joining multiple attributes into one.
4-Splitting – splitting a single attribute into multiple attributes.

8
Fixing errors: misspellings, Also: decoding, reformatting, time
erroneous dates, incorrect field usage, stamping, conversion, key generation,
mismatched addresses, missing data, merging, error detection/logging, locating
9
duplicate data, inconsistencies missing data
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
10
Aggregation – data summarization one field to many
Load Step
 After the data is transformed, it is
loaded into the data warehouse. This
step involves creating the physical data
structures and loading the data into the
warehouse.

 The ETL process is an iterative process


that is repeated as new data is added to
the warehouse. The process is
important because it ensures that the
data in the data warehouse is accurate,
complete, and up-to-date. It also helps
to ensure that the data is in the format
required for data mining and reporting.

 Additionally, there are many different ETL


tools and technologies available, such as
Informatica, Talend, DataStage, and others,
that can automate and simplify the ETL
process.
11
Update mode: only changes in
Refresh mode: bulk rewriting of
source data are written to data
target data at periodic intervals
warehouse 12
Data Warehouse Architectures
 Independent Data Mart
 Three-Layer architecture

13
Data Mart
 The data mart is a subset of
the data warehouse that is usually
oriented to a specific business line or
team.
 Data marts are small slices of the data
warehouse. Whereas data warehouses
have an enterprise-wide depth, the
information in data marts is attached to
a single department.
14
Independent data mart Data marts:
Mini-warehouses, limited in scope

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
15
Dependent data mart with ODS provides option for
operational data store obtaining current data

T
E
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
16
ODS and data warehouse
are one and the same

T
E
Near real-time ETL for
Data marts are NOT separate
@active Data Warehouse
databases, but logical views of the
data warehouse
17
 Easier to create new data marts
Data warehouse layers
 Source layer
 The logical layer of all systems of record
(SOR) that feed data into the warehouse. They
could include point-of-sale, marketing
automation, CRM, or ERP systems. Each
source SOR has a specific data format and
may require a different data capture method
based on that data format.
18
Data warehouse layers
 Staging layer
 A landing area for data from the source SOR.
A data staging best practice is to ingest data
from the SOR without applying business logic
or transformations. It’s also critical to ensure
that staging data is not used in production data
analysis; data in the staging area has yet to be
cleansed, standardized, modeled, governed,
and verified.
19
Data warehouse layers
 Warehouse layer
 The layer where all of the data is stored. The
warehouse data is now subject-oriented,
integrated, time-variant, and non-volatile. This
layer will have the physical schemas, tables,
views, stored procedures, and functions
needed to access the warehouse-modeled data.

20
Data warehouse layers
 End-user layer
 Also known as the analytics layer, is where
you model data for consumption using
analytics tools like ThoughtSpot, data analysts,
data scientists, and business users.

21
Generic two-level architecture

L
One company-
wide warehouse
T

Periodic extraction  data is not completely current in warehouse


22
On-Line Analytical Processing (OLAP)
 The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using
simple windowing techniques
 OLAP business intelligence queries often aid in
trends analysis, financial reporting, sales
forecasting, budgeting and other planning
purposes.

23
24
On-Line Analytical Processing (OLAP)
 OLAP Operations
 Dice slicing – Dice. This allows an analyst

to select data from multiple dimensions to


analyze, such as "sales of blue beach balls
in Iowa in 2017."
 Drill-down – going from summary to more
detailed views: This allows analysts to
navigate deeper among the dimensions of
data. For example, drilling down from "time
period" to "years" and "months" to chart
sales growth for a product.

25
Slicing a data cube

26
Summary report

Example:
Drill-down

Drill-down with color added

27

You might also like