Data warehousing
11
Outline
Part 1:
I. Introduction to Data Warehousing
II. Architecture of Data Warehousing
III. Design and Modeling in Data Warehousing
Part 2:
IV. ETL Processes in Data Warehousing
A. Extracting Data
1. Data Extraction Techniques
2. Data Profiling
B. Transforming Data
1. Data Cleaning and Quality
2. Data Integration
C. Loading Data 2
Data warehouse: Definition
A data warehouse is a centralized repository that integrates and stores large volumes
of structured, historical data from various sources within an organization.
It is designed for the purpose of supporting business intelligence (BI) activities,
including reporting, analysis, and decision-making processes.
Data warehouses provide a consolidated view of an organization's data, allowing
users to analyze trends, identify patterns, and gain valuable insights that can inform
strategic and operational decisions.
Data warehouses play a crucial role in business intelligence by providing decision-
makers with a unified and consistent view of historical data. 3
Keys characteristics
Key characteristics of a data warehouse include:
• Subject-Oriented: Data warehouses are organized around specific business
subjects or areas, such as sales, finance, or customer relations, to support
analytical queries and reporting within those domains.
• Integrated Data: Data from disparate sources, such as transactional databases,
spreadsheets, and external systems, is integrated and transformed to ensure
consistency and coherence in the warehouse. This integration process is often
facilitated through ETL (Extract, Transform, Load) procedures.
• Time-Variant: Data in a data warehouse is time-stamped, allowing users to
analyze trends and changes over time. This time-variant aspect enables historical
analysis and reporting.
4
Keys characteristics
• Non-Volatile: Unlike operational databases that are frequently updated with
transactional data, a data warehouse is non-volatile. Once data is loaded into
the warehouse, it is typically not updated or deleted, ensuring a stable
environment for analytical processing.
• Optimized for Query and Reporting: Data warehouses are structured and
indexed for efficient querying and reporting. They often use denormalized
schemas, such as star or snowflake schemas, to simplify and accelerate
analytical queries.
5
Data warehouse VS Database (1/3)
Data warehouse Database
Purpose Primarily designed for Designed for transactional
analytical processing and processing and day-to-day
business intelligence. It is operations. Focus is on efficient
optimized for complex data retrieval, insertion, and
queries and reporting. updating.
Data Types Stores large volumes of Stores operational data, often in
historical, structured data. real-time. Primarily contains
Often includes data from current and frequently updated
multiple sources within the information.
organization.
Schema Design Uses specialized schemas like Typically uses normalized
star schema or snowflake schemas to reduce redundancy
schema for efficient querying and maintain data integrity.
and reporting. Normalization helps in 6
transactional processing.
Data warehouse VS Database (2/3)
Data warehouse Database
Data Integration Involves the integration of data May store data from a specific
from various sources using ETL application or domain.
(Extract, Transform, Load) Integration is focused on
processes to ensure maintaining consistency within
consistency and coherence. the operational context.
Data Volatility Non-volatile; historical data is Volatile; data is frequently
stored and rarely updated. updated and modified as part
Changes typically involve of ongoing transactions.
adding new data rather than
modifying existing records.
7
Data warehouse VS Database (3/3)
Data warehouse Database
Query Optimization Optimized for complex Optimized for fast retrieval
queries. and updating of individual
records.
User Base Primarily used by analysts, Used by application
data scientists, and decision- developers, system
makers for in-depth analysis, administrators, and
reporting, and business operational staff for day-to-
intelligence activities. day application support and
transactional processing.
Data Processing Online Analytical Processing Online Transactional
(OLAP) Processing (OLTP)
8
OLTP VS OLAP
Data warehouses are tailored for analytical
processing, historical analysis, and business
intelligence, whereas databases are focused 9
on supporting transactional processing and
day-to-day operations.
Main Components of a Data
Warehouse
A data warehouse comprises several components that work together to facilitate the
storage, integration, and retrieval of large volumes of data for analytical processing.
The main components of a data warehouse include:
1. Data Sources:
These are systems or applications that generate and store data. Data sources can
include operational databases, external data feeds, spreadsheets, and other
repositories.
2. ETL (Extract, Transform, Load) Processes:
ETL processes are responsible for extracting data from various sources,
transforming it to conform to the data warehouse's structure and quality 10
standards, and loading it into the data warehouse.
Main Components of a Data
Warehouse
3. Data Warehouse Database:
The central repository that stores the integrated and transformed data. It is
optimized for analytical querying and reporting. Data warehouses often use
specialized database management systems (DBMS) designed for analytical
workloads.
4. Data Marts:
Data marts are subsets of the data warehouse that focus on specific business
functions or departments. They are often designed for the needs of a particular
group of users.
5. OLAP (Online Analytical Processing) Servers:
OLAP servers enable users to interactively analyze and explore data in a 11
multidimensional way. OLAP provides capabilities for slicing and dicing data,
drilling down into details, and performing complex analyses.
Design and Modeling in Data
Warehousing
Data warehouse modeling involves designing the structure and organization of
data within a data warehouse to facilitate efficient querying, reporting, and
analysis.
The goal is to provide a clear and optimized representation of data that supports
business intelligence and decision-making.
Dimensional modeling is much better suited for business intelligence (BI)
applications and data warehousing (DW)
The key concepts in dimensional modeling are facts, dimensions, and attributes. 12
All these concepts can be organized in several ways, called schemas.
Dimensional modeling overview
The fact Tbl_Fact_Store_Sales is at the core of the dimensional model
Four surrounding dimensions that define and put into context the store
sales:
• Tbl_Dim_Item, which is what products were sold.
• Tbl_Dim_Date, which is when those products were sold
• Tbl_Dim_Customer, who bought the products 13
• Tbl_Dim_Buyer, who bought the product for the store
Key concepts: Facts Tables
A fact is a measurement of a business activity, such as a business event or
transaction, and is generally numeric.
Examples of facts are sales, expenses, and inventory levels
Fact tables are composed of two types of columns: keys and measures
• The first, the key column, consists of a group of foreign keys (FK) that point
to the primary keys of dimensional tables that are associated with this fact
table to enable business analysis. The relationships between fact tables and
the dimensions are one-to-many.
• The second type of column is the actual measures of the business activity
such as the sales revenue and order quantity. Every measurement has a
grain, which is the level of detail in the measurement of an event such as a 14
unit of measure or currency used.
Facts Tables: Example
15
Fact table—primary key is a surrogate key. Fact table— several measures.
Key concepts : Dimension
A dimension is an entity that establishes the business context for the measures
(facts) used by an enterprise.
Dimensions define the who, what, where, and why of the dimensional model,
and group similar attributes into a category or subject area. Examples of
dimensions are product, geography, customers, employees, and time. Whereas
facts are numeric, dimensions are descriptive in nature (although some of those
descriptions, such as a product list price, may be numeric).
Creating a dimension enables facts to store attributes in a single place
16
Dimension
Dimensions keep the database from being overrun with redundant data. With all
the attributes in a dimension table, they don’t have to be repeated in the fact
tables.
Example:
Take Amazon, for example. The data for an individual sale will contain the product
identification number, but will not repeat all the attributes of the product (color,
description, reviews, etc.). Those attributes are in a dimension, and each individual
sale of that product just points to them.
From a business perspective, the key purpose of dimensions it to use their
17
attributes to filter and analyze data based on performance measures
Dimension
• Dimensions are used for
• Selection of data
• Grouping of data at the right level of detail
• Dimensions consist of dimension values
• Product dimension has values ”milk”, ”cream”, …
• Time dimension has values ”1/1/2001”, ”2/1/2001”,…
• Dimension values may have an ordering
• Used for comparing cube data across values
• Especially used for Time dimension
18
Dimension
• Dimensions have hierarchies with levels
• Typically 3-5 levels (of detail)
• Dimension values are organized in a tree structure
• Product: Product->Type->Category
• Store: Store->Area->City->County
• Time: Day->Month->Quarter->Year
• Dimensions have a bottom level and a top level
• Levels may have attributes
• Simple, non-hierarchical information
• Day has Workday as attribute
• Dimensions should contain much information
19
• Time dimension may contain holiday, season, events,…
• Good dimensions have 50-100 or more attributes/levels
Dimensional model: Example
Example: sales of supermarkets
• Facts and measures
• Each sales record is a fact, and its sales value is a measure
• Dimensions
• Group correlated attributes into the same dimension
• Each sales record is associated with its values of Product, store,
Time
20
Granularity: Dimensionality
Hierarchy
• Granularity of facts is important
• Level of detail
• Given by combination of bottom levels
• A dimensional hierarchy defines mappings from a set of lower-level
concepts to higher level concepts.
21
Data Warehouse Design
• A schema is a logical description of the entire database.
• Database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.
22
Star Schema
In a star schema, there is a central fact table surrounded by dimension
tables.
Each dimension in a star schema is represented with only one-dimension
table
The fact table contains numerical measures (such as sales or revenue), and
dimension tables provide descriptive information about the measures .
This dimension table contains the set of attributes.
23
Star Schema: Example
24
Snowflake schema
Snowflake schema is an expanded version of a star schema in which
dimension tables are normalized into several related tables.
• Advantages
• Small saving in storage space
• Normalized structures are easier to update and maintain
• Disadvantages
• A schema that is less intuitive
• The ability to browse through the content is difficult
25
• A degraded query performance because of additional joins.
Snowflake schema: Example
26
Fact constellation
schema
• A fact constellation has multiple fact tables. It is also known as galaxy
schema.
• The following diagram shows two fact tables, namely sales and Inventory
27
From the Data Warehouse to
Data Marts
• A data mart contains only those data that are specific to a particular
group. For example, the marketing data mart may contain only data
related to items, customers, and sales.
• Data marts are confined to subjects.
• Data marts are small in size.
• Data marts are customized by department
28
The complete Decision Support
System
29
DWH Architecture
30
Types of Data Warehousing
Architectures
1. Centralized Data Warehouse : is a single, unified repository that stores
and manages data from various sources within an organization. It serves
as a centralized and integrated platform for business intelligence and
decision-making.
2. Data Marts : are smaller, specialized subsets of a data warehouse that
focus on specific business areas, departments, or user groups. They are
designed to meet the needs of a particular set of users with common
31
interests.
Types of Data Warehousing
Architectures
3. Federated Data Warehouse : is an architecture that integrates data
from multiple independent data sources without physically
consolidating the data into a central repository. It enables distributed
data access and processing.
4. Hybrid Data Warehouse: combines elements of both centralized and
distributed architectures. It may involve a mix of on-premises and
cloud-based solutions, as well as a combination of centralized and
32
federated approaches.
Extraction Transformation
Loading–ETL tools
33
Data architecture VS Data
modeling
• Data architecture applies to the higher-level view of how the enterprise
handles its data, such as how it is categorized, integrated, and stored.
• Data modeling applies to very specific and detailed rules about how pieces
of data are arranged in the database. Where data architecture is the
blueprint for your house, data modeling is the instructions for installing a
faucet.
34
Kimball Approach:
• Kimball emphasizes the use of dimensional modeling, creating star or
snowflake schemas. This approach focuses on designing the data
warehouse based on business processes and user requirements.
• Follows a bottom-up development approach, starting with the creation
of data marts that address immediate business requirements. These
data marts are then integrated to form the complete data warehouse.
• Kimball's approach involves the use of Extract, Transform, Load (ETL)
processes that are specifically designed for dimensional models. This
35
ensures the transformation of source data into a format optimized for
reporting and analysis.
Kimball Approach:
36
Inmon's Approach:
Inmon supporters the creation of a centralized Enterprise Data Warehouse
(EDW) as the foundation. This EDW serves as a single, integrated repository for
the entire organization.
Inmon's approach follows a top-down development methodology. It begins
with the creation of an enterprise-wide data warehouse and then focuses on
building data marts to meet specific business needs.
37
Kimball VS Inmon’s Approach
Philosophy:
Kimball: Business-driven, iterative, and agile.
Inmon: Enterprise-centric, normalized, and long-term.
Data Model:
Kimball: Dimensional modeling, star or snowflake schemas.
Inmon: Normalized data model, 3NF.
Development Approach:
Kimball: Bottom-up development, starting with data marts.
Inmon: Top-down development, starting with the enterprise data warehouse.
Data Marts:
Kimball: Considers data marts as primary deliverables.
Inmon: Views data marts as subsets of the enterprise data warehouse.
Flexibility: 38
Kimball: Agile and adaptable to changing business needs.
Inmon: Emphasizes a stable and scalable architecture for long-term use.
Kimball approach: Main
steps
1. Choose the subject : Clearly define the business objectives and scope of
the data warehouse project.
2. Requirements Gathering: Collaborate closely with business users to
gather their reporting and analysis requirements.
3. Dimensional Modeling: Star or Snowflake Schema: Develop dimensional
models using star or snowflake schemas. Identify Dimensions and Facts
4. ETL Design and Development: Create Extract, Transform, Load (ETL)
processes based on dimensional models.
5. Data Mart Development: Develop data marts as subsets of the data
warehouse, addressing specific business needs.
39
6. Business Intelligence Tools Integration: Choose and integrate business
intelligence tools compatible with dimensional models.