Data Warehousing Unit 1
Data Warehousing Unit 1
In essence, a data warehouse is not just another database; it's a strategically designed
system that transforms raw operational data into meaningful information for business
users.
Organizations develop data warehouses to address several critical business needs and overcome
the limitations of relying solely on operational systems for analytical purposes. The primary
needs include:
In conclusion, the development of a data warehouse is driven by the need for organizations to
transform their raw data into actionable insights, enabling better decision-making, enhancing
business intelligence, gaining a competitive edge, and improving overall business performance.
It provides a foundation for effective analysis and a deeper understanding of the organization's
past, present, and potential future.
The architecture of a data warehouse system typically includes the following components:
1. Source Systems: These are the operational systems and external data sources that
provide the raw data for the data warehouse. Examples include:
o Transactional databases (OLTP systems)
o Customer Relationship Management (CRM) systems
o Enterprise Resource Planning (ERP) systems
o Flat files (e.g., CSV, Excel)
o Web logs
o External data feeds
2. ETL (Extraction, Transformation, and Loading) Process: This is the core process
responsible for moving data from the source systems to the data warehouse. It involves
three main stages:
o Extraction: Retrieving data from the various source systems. This may involve
selecting relevant data, handling different data formats, and dealing with data
access methods.
o Transformation: Cleaning, integrating, and transforming the extracted data into
a consistent and usable format for the data warehouse. This includes tasks like:
Data cleaning (handling missing values, correcting errors, resolving
inconsistencies)
Data integration (combining data from different sources, resolving naming
conflicts)
Data transformation (converting data types, standardizing units, creating
derived attributes)
o Loading: Writing the transformed data into the data warehouse. This often
involves initial loading of historical data and subsequent periodic updates.
3. Data Warehouse: This is the central repository where the integrated and cleaned data is
stored. It is typically a relational database management system (RDBMS) optimized for
analytical queries. Common data warehouse database systems include:
o Teradata
o Snowflake
o Amazon Redshift
o Google BigQuery
o Microsoft Azure Synapse Analytics
4. Metadata Repository: This component stores "data about data." It contains information
about the data warehouse structure, data sources, ETL processes, data transformations,
data quality rules, security access, and user information. Metadata is crucial for
understanding, managing, and using the data warehouse effectively. Types of metadata
include:
o Technical Metadata: Describes the data warehouse structure (schemas, tables,
columns, data types), ETL processes, and data source information.
o Business Metadata: Provides business context to the data in the warehouse,
including business rules, data definitions, and user-friendly descriptions.
5. Data Warehouse Access Tools: These are the tools used by business users to interact
with the data in the warehouse for analysis, reporting, and decision-making. Common
types of access tools include:
o Query and Reporting Tools: Allow users to formulate SQL queries or use
graphical interfaces to generate reports and perform ad-hoc analysis.
o OLAP (Online Analytical Processing) Tools: Enable multi-dimensional
analysis of data, allowing users to slice and dice data to gain insights from
different perspectives.
o Data Mining Tools: Used to discover hidden patterns, trends, and relationships in
the data.
o Business Intelligence (BI) Platforms: Integrated suites of tools that provide a
comprehensive environment for data analysis, visualization, and reporting (e.g.,
Tableau, Power BI, Qlik Sense).
6. Data Marts (Optional): These are subject-oriented subsets of the data warehouse
designed to meet the specific analytical needs of particular business units or user groups
(e.g., marketing data mart, sales data mart). Data marts can improve query performance
and make data access easier for specific teams. They can be:
o Dependent Data Marts: Created directly from the central data warehouse.
o Independent Data Marts: Developed independently of the central data
warehouse, often from source systems. This approach can lead to data
inconsistencies and is generally less preferred.
7. Data Governance Framework: This encompasses the policies, procedures, and
responsibilities for ensuring the quality, security, and integrity of the data in the data
warehouse. It addresses aspects like data ownership, data standards, data access control,
and data auditing.
The design of a data warehouse is an iterative process, and the steps may overlap. A well-
designed data warehouse is crucial for providing timely, accurate, and insightful information that
supports effective business decision-making.
Measures (Facts)
Definition: Measures, often referred to as facts, are the numerical values that represent
the business events or transactions you want to analyze. They are typically quantitative
and can be aggregated (summed, averaged, counted, etc.).
Characteristics:
o Numeric: Measures are always numeric.
o Aggregatable: The primary purpose of measures is to be aggregated across
different dimensions to gain insights.
o Quantitative: They represent quantities, amounts, or counts.
o Stored in Fact Tables: Measures reside in the fact tables of a dimensional model.
Examples:
o Sales Amount: The total revenue generated from a sale.
o Quantity Sold: The number of units of a product sold.
o Profit: The difference between revenue and cost.
o Number of Orders: The count of individual orders placed.
o Inventory Level: The current stock of a product.
o Website Visits: The number of times a webpage was accessed.
o Temperature: A recorded temperature reading.
Dimensions
Definition: Dimensions provide the context for the measures. They are descriptive
attributes that help you to categorize, group, and filter the measures for analysis.
Dimensions answer the questions of "who," "what," "where," "when," and "why" about
the business events.
Characteristics:
o Descriptive: Dimensions contain textual or categorical data that describes the
business context.
o Qualitative: They represent qualities or characteristics.
o Used for Filtering and Grouping: Dimensions are used to slice and dice the
measures in reports and analytical queries.
o Stored in Dimension Tables: Dimension attributes are stored in dimension
tables, which are typically linked to the fact table through foreign keys.
Examples of Common Dimensions:
o Time Dimension: Date, Month, Year, Quarter, Day of Week.
o Product Dimension: Product Name, Product Category, Brand, Color, Size.
o Customer Dimension: Customer Name, Customer ID, Age Group, Gender,
Location.
o Location Dimension: City, State, Country, Region.
o Organization Dimension: Department, Business Unit.
o Salesperson Dimension: Salesperson Name, Sales Team, Sales Region.
Fact Tables: Contain measures and foreign keys that reference the primary keys of the
dimension tables. Each row in a fact table represents a specific business event at the
intersection of the associated dimension members.
Dimension Tables: Contain the descriptive attributes of each dimension. Each row in a
dimension table represents a unique member of that dimension.
Analysis: By joining fact tables with dimension tables, you can analyze the measures
across different dimensional attributes. For example:
o "What was the total sales amount (measure) for each product category
(dimension) in each month (dimension)?
o "How many customers (measure - count of distinct customer IDs) are in each
age group (dimension) and each region (dimension)?
Understanding the distinction and relationship between dimensions and measures is crucial for
designing effective data warehouses and creating meaningful business intelligence reports and
analyses. They form the foundation of dimensional modeling, which is widely used for
organizing data for analytical purposes.
Definition: A dependent data mart is created directly from an existing enterprise data
warehouse. It relies on the central data warehouse as its single source of truth.
Characteristics:
o Single Source of Data: Data is sourced exclusively from the enterprise data
warehouse.
o Consistency: Ensures data consistency across different data marts as they all
draw from the same integrated and cleaned data.
o Reduced Data Redundancy: Data is stored centrally, minimizing redundancy in
the data mart.
o Simplified ETL: The ETL process is primarily focused on building and
maintaining the central data warehouse. Data marts require a simpler extraction
and transformation process from the warehouse.
o Easier Integration: Integration between different dependent data marts is
straightforward as they share a common foundation.
o Scalability: The scalability of dependent data marts is tied to the scalability of the
central data warehouse.
Architecture:
Source Systems
|
V
Enterprise Data Warehouse
|
---------------------
| |
V V
Data Mart A Data Mart B
(e.g., Sales) (e.g., Marketing)
Advantages:
o Data Consistency: The most significant advantage is data consistency across the
organization.
o Reduced Development Time: Building dependent data marts is generally faster
as the core data integration and cleaning are already done in the central
warehouse.
o Simplified Management: Easier to manage and maintain as the data origin is
centralized.
o Improved Data Governance: Benefits from the data governance policies
established for the enterprise data warehouse.
Disadvantages:
o Reliance on Central Warehouse: The availability and performance of dependent
data marts are tied to the central data warehouse. Any issues with the central
warehouse can impact the data marts.
o Potential Bottleneck: If the central data warehouse is not well-designed or
performs poorly, it can become a bottleneck for accessing data in the data marts.
o Less Flexibility: Less flexibility in terms of data modeling and integration
specific to the needs of a particular business unit if the central warehouse design
doesn't fully accommodate them.
Advantages:
o Faster Implementation (Initially): Can be implemented quickly for specific
needs without waiting for the development of a large enterprise data warehouse.
o Tailored to Specific Needs: Can be designed and optimized to meet the exact
analytical requirements of a particular business unit.
o Flexibility: Offers greater flexibility in terms of data modeling and source system
integration for a specific area.
Disadvantages:
o Data Inconsistency: The biggest drawback is the potential for inconsistent data
across different data marts and the rest of the organization.
o Data Redundancy: Increased data redundancy leads to higher storage costs and
potential inconsistencies.
o Complex Management: Managing multiple independent ETL processes and data
marts can be complex and resource-intensive.
o Difficult Integration: Integrating data across different independent data marts for
a holistic view of the business is challenging.
o Lack of Enterprise View: Does not provide a unified view of the organization's
data.
Definition: A distributed data mart architecture involves having multiple data marts that
are interconnected and can share data. This approach attempts to combine some of the
benefits of both dependent and independent data marts.
Characteristics:
o Interconnected: Data marts can communicate and exchange data.
o Subject-Oriented: Each data mart focuses on a specific business area.
o Potential for Controlled Redundancy: Some data might be replicated across
data marts for performance or accessibility reasons, but ideally, it's managed and
controlled.
o Complex Architecture: Requires careful design and management to ensure data
consistency and efficient data sharing.
o Federated Approach: Can be seen as a federated approach where each data mart
is autonomous but participates in a larger data sharing framework.
Architecture (Conceptual):
Source Systems
|
V
(Optional) Staging Area
|
---------------------
| |
V V
Data Mart A <-------> Data Mart B <-------> Data Mart C
(e.g., Sales) (e.g., Marketing) (e.g., Finance)
^ ^ ^
| | |
(Potentially shared/derived data)
Advantages:
o Flexibility and Scalability: Allows for flexibility in designing data marts for
specific needs while still enabling data sharing and integration.
o Improved Performance: Data marts can be optimized for specific user groups,
potentially improving query performance.
o Hybrid Approach: Can leverage existing independent data marts while gradually
integrating them or building new dependent data marts.
Disadvantages:
o Complexity: Designing and managing a distributed data mart environment is
more complex than either a purely dependent or independent approach.
o Data Consistency Challenges: Ensuring data consistency across interconnected
data marts requires careful planning and implementation of data sharing and
synchronization mechanisms.
o Governance Issues: Establishing and enforcing data governance policies across
distributed data marts can be challenging.
In Summary:
Dependent Data Marts: Best when data consistency and a single version of the truth are
paramount, and a robust enterprise data warehouse exists.
Independent Data Marts: May be considered for very specific, isolated needs or as a
starting point before building a full data warehouse, but they often lead to data
inconsistencies and management challenges in the long run.
Distributed Data Marts: Offer a compromise, providing flexibility while attempting to
maintain some level of integration and data sharing, but they require careful design and
management to avoid complexity and consistency issues.
The choice of data mart architecture depends on the organization's specific needs, existing
infrastructure, data governance policies, and long-term data warehousing strategy. In most
modern data warehousing environments, a dependent data mart approach built upon a well-
designed enterprise data warehouse is generally preferred for its benefits in terms of data
Conceptual Modeling of Data Warehouses
Conceptual modeling is the initial high-level stage in data warehouse design. It focuses on
understanding and representing the business requirements and the key business entities and their
relationships, without delving into the technical details of database implementation. The goal is
to create a business-oriented model that serves as a blueprint for the subsequent logical and
physical design phases.
Identifying Business Processes: Understanding the core business activities that need to
be analyzed (e.g., sales, orders, shipments, customer interactions).
Identifying Key Business Entities: Determining the central subjects of analysis related
to these processes (e.g., Customers, Products, Time, Locations, Salespeople).
Defining Relationships: Understanding how these entities interact within the business
processes.
Identifying Key Measures: Determining the quantitative data points that need to be
tracked and analyzed (e.g., Sales Amount, Quantity Sold, Profit, Number of Visits).
Defining Dimensions: Identifying the contextual attributes that provide different
perspectives for analyzing the measures (e.g., Product Category, Customer Segment,
Sales Region, Month).
1. Star Schema
Structure: The star schema is characterized by a central fact table surrounded by several
dimension tables, resembling a star.
Fact Table: Contains the primary measures (quantitative data) and foreign keys that
reference the primary keys of the dimension tables. It represents the relationships
between the dimensions for a specific business event.
Dimension Tables: Contain the descriptive attributes that provide context to the facts.
They are typically denormalized, meaning they may contain redundant data to simplify
querying and improve performance. Each dimension table has a primary key that is
referenced by one or more foreign keys in the fact table.
Simplicity: The star schema is relatively simple to understand and query, making it
popular for OLAP applications.
Query Performance: The denormalized dimension tables often lead to fewer joins in
queries, which can improve performance.
Example:
Time Dimension
(Time_ID, Year, Quarter, Month, Day)
|
| FK
-----------------
| |
| Sales Fact |
|---------------|
| Sales_ID (PK) |
| Product_ID (FK)|-------> Product Dimension
| Customer_ID (FK)|-------> (Product_ID, Product_Name,
Category, Brand)
| Time_ID (FK) |
| Location_ID (FK)|-------> Location Dimension
| Sales_Amount | (Location_ID, City, State,
Country)
| Quantity_Sold |
| ... |
-----------------
|
| FK
Customer Dimension
(Customer_ID, Customer_Name, Age_Group, Gender)
2. Snowflake Schema
Structure: The snowflake schema is an extension of the star schema where the
dimension tables are normalized. This means that the dimension tables are further
broken down into multiple related tables to eliminate redundancy.
Normalization: Dimension attributes are organized into hierarchical structures, creating
a snowflake-like shape when diagrammed. For example, a Location dimension might be
split into Location, City, State, and Country tables.
Reduced Data Redundancy: Normalization reduces data redundancy and can improve
data integrity.
Increased Query Complexity: Querying a snowflake schema often requires more joins
compared to a star schema, which can potentially impact query performance.
Data Integrity: The normalized structure can enforce data integrity more effectively.
|
| FK
Location Dimension
(Location_ID, City_ID)
| FK
|
City Dimension
(City_ID, City_Name, State_ID)
| FK
|
State Dimension
(State_ID, State_Name, Country_ID)
| FK
|
Country Dimension
(Country_ID, Country_Name)
Structure: A fact constellation, also known as a galaxy schema, consists of multiple fact
tables that share some common dimension tables. This model is used when there are
multiple business processes with shared dimensions.
Multiple Fact Tables: Each fact table represents a different business process (e.g., sales,
shipping, billing).
Shared Dimensions: Dimension tables like Time, Product, and Customer can be shared
across these multiple fact tables, ensuring consistency in the dimensional data.
Complexity: Fact constellations can be more complex to design and manage than single
fact table schemas.
Comprehensive View: This model allows for a more comprehensive view of the
business by integrating data from different but related business processes.
Example:
Time Dimension
(Time_ID, Year, Quarter, Month, Day)
^
|
---------------------------------------
| |
V V
----------------- --------------------
| | | |
| Sales Fact | | Shipping Fact |
|---------------| |------------------|
| Sales_ID (PK) | | Shipment_ID (PK) |
| Product_ID (FK)|--------------------->| Product_ID (FK) |
| Customer_ID (FK)|--------------------->| Customer_ID (FK) |
| Time_ID (FK) |<--------------------->| Time_ID (FK) |
| Location_ID (FK)| | Location_ID (FK) |
| Sales_Amount | | Shipment_Date |
| Quantity_Sold | | Delivery_Date |
| ... | | Shipping_Cost |
----------------- --------------------
^
|
Product Dimension
(Product_ID, Product_Name, Category, Brand)
^
|
Customer Dimension
(Customer_ID, Customer_Name, Age_Group, Gender)
^
|
Location Dimension
(Location_ID, City, State, Country)
The choice between star, snowflake, and fact constellation schemas depends on various factors,
including:
In practice, a star schema or a variation of it is often the preferred choice for its balance of
simplicity and performance. Snowflake schemas might be used for specific dimensions where
data integrity and reduced redundancy are critical, even if it slightly increases query complexity.
Fact constellations are employed when modeling multiple, interconnected business processes
that share common dimensions.
Aggregates.
Key Concepts:
Dimensions: These are the categories that define the perspectives for analyzing the data.
They represent the "who," "what," "where," and "when" of the data. Examples include:
o Time (Year, Quarter, Month, Day)
o Product (Category, Sub-Category, Product Name)
o Location (Country, State, City)
o Customer (Segment, Region, Individual Customer)
Measures (Facts): These are the numerical values that you want to analyze. They
represent the "how much" or "how many." Measures are typically quantitative and can be
aggregated. Examples include:
o Sales Amount
o Quantity Sold
o Profit
o Number of Orders
o Website Visits
Cells: Each cell in the multidimensional cube represents a specific combination of
dimension members and contains the value of the measure(s) for that combination.
Hierarchies: Dimensions often have hierarchical relationships. For example, the Time
dimension can have a hierarchy of Year -> Quarter -> Month -> Day. These hierarchies
allow for drilling down (going from a higher level to a lower level of detail) and rolling
up (aggregating data from a lower level to a higher level).
Levels: Each level in a hierarchy represents a different granularity of the dimension (e.g.,
Year level, Month level).
Attributes: Dimensions can have attributes that provide further descriptive information
about the dimension members (e.g., Product Name is an attribute of the Product
dimension).
The multidimensional model supports various OLAP operations that enable users to analyze data
from different angles:
Types of Aggregates:
Aggregates can be created at different levels of granularity for various dimensions and
combinations of dimensions. For example:
Identify Common Queries: Analyze query patterns and user needs to determine which
aggregations will provide the most significant performance benefits.
Balance Performance and Storage: Creating too many aggregates can improve
performance but also significantly increase storage space and the time required for ETL
processes to maintain them. A balance needs to be struck based on query frequency and
storage capacity.
Aggregate Navigation: The data warehouse system should have a mechanism (often
called an aggregate navigator or query rewrite engine) to automatically identify the most
appropriate aggregate table to use when a query is submitted, making the use of
aggregates transparent to the end-user.
Maintenance: Aggregates need to be updated whenever new data is loaded into the base
fact tables, which adds to the complexity and processing time of the ETL process.
Implementation of Aggregates:
Materialized Views: Many database systems support materialized views, which are pre-
computed and stored results of SQL queries. These can be used to create and manage
aggregates.
Summary Tables: Separate tables can be explicitly created in the data warehouse to
store aggregated data.
OLAP Cubes: OLAP cubes inherently store data in an aggregated format along different
dimensions and hierarchies.
In conclusion, the multidimensional data model provides a powerful way to structure data for
analytical purposes, and aggregates are a crucial technique for optimizing the performance of
data warehouses by pre-calculating and storing summary data. The effective design and
management of both the data model and aggregates are essential for building a high-performing
and user-friendly business intelligence system.