0% found this document useful (0 votes)
7 views18 pages

Data Warehousing Unit 1

Data warehousing is a structured system designed to support business intelligence and analytics by integrating, storing, and managing historical data for decision-making. Organizations develop data warehouses to improve decision-making, enhance business intelligence, gain competitive advantages, and ensure data consistency and quality. The design and implementation of a data warehouse involve several key components, including source systems, ETL processes, data storage, and access tools, all aimed at providing actionable insights for users.

Uploaded by

aikechhichhore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Data Warehousing Unit 1

Data warehousing is a structured system designed to support business intelligence and analytics by integrating, storing, and managing historical data for decision-making. Organizations develop data warehouses to improve decision-making, enhance business intelligence, gain competitive advantages, and ensure data consistency and quality. The design and implementation of a data warehouse involve several key components, including source systems, ETL processes, data storage, and access tools, all aimed at providing actionable insights for users.

Uploaded by

aikechhichhore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Warehousing: Introduction and Needs

Introduction to Data Warehousing

Data warehousing is a subject-oriented, integrated, time-variant, and non-volatile collection of


data used primarily to support business intelligence (BI), analytics, and decision-making
processes. It serves as a central repository for data that has been cleaned, transformed, and
cataloged, making it readily available for querying and analysis.

Let's break down the key characteristics:

 Subject-Oriented: Data in a data warehouse is organized around major subjects of the


business, such as customers, products, sales, and suppliers. This contrasts with
operational systems, which are often application-oriented (e.g., order entry, inventory
management). The focus is on providing information relevant for analysis of these key
business areas.
 Integrated: Data from various heterogeneous source systems (e.g., transactional
databases, CRM systems, ERP systems, flat files) is consolidated and made consistent
within the data warehouse. This involves resolving inconsistencies in naming
conventions, data formats, units of measure, and data structures. Integration provides a
unified view of the organization's data.
 Time-Variant: Data in a data warehouse is recorded with a historical perspective. It
maintains a history of data over time, allowing for trend analysis, comparisons over
different periods, and forecasting. Operational systems typically store only current data.
Time-variance is crucial for understanding how business conditions have evolved.
 Non-Volatile: Once data is in the data warehouse, it is generally not updated or deleted.
New data is added periodically (e.g., daily, weekly, monthly), but historical data is
retained. This ensures a stable and consistent data set for reporting and analysis.

In essence, a data warehouse is not just another database; it's a strategically designed
system that transforms raw operational data into meaningful information for business
users.

Needs for Developing a Data Warehouse

Organizations develop data warehouses to address several critical business needs and overcome
the limitations of relying solely on operational systems for analytical purposes. The primary
needs include:

1. Improved Decision Making:


o Data warehouses provide a consolidated and consistent view of business data,
enabling managers and executives to make more informed and data-driven
decisions.
o By analyzing historical trends and patterns, organizations can gain insights into
past performance, identify opportunities, and anticipate future challenges.
o Features like online analytical processing (OLAP) and data mining tools can be
effectively used on data warehouse data to uncover hidden patterns and
relationships.
2. Enhanced Business Intelligence and Reporting:
o Data warehouses are specifically designed for efficient querying and reporting.
They provide a structured environment optimized for analytical tasks, unlike
operational databases that are optimized for transactional processing.
o Generating complex reports, performing ad-hoc analysis, and creating insightful
visualizations become easier and faster with a well-designed data warehouse.
o It enables the creation of key performance indicators (KPIs) and dashboards to
monitor business performance effectively.
3. Competitive Advantage:
o By gaining deeper insights into customer behavior, market trends, and operational
efficiency, organizations can identify opportunities to differentiate themselves
from competitors.
o Data warehousing supports customer relationship management (CRM) initiatives
by providing a comprehensive view of customer interactions and preferences.
o It can help optimize supply chains, improve marketing campaigns, and identify
new product or service opportunities.
4. Data Integration and Consistency:
o Organizations often have data scattered across various disparate systems. A data
warehouse integrates this data, resolving inconsistencies and providing a single
version of the truth.
o This eliminates the problems associated with data silos and ensures that different
departments are working with the same reliable information.
5. Historical Analysis and Trend Identification:
o Operational systems typically only retain current data, making it difficult to
analyze trends over time. Data warehouses maintain historical data, allowing
organizations to track changes, identify patterns, and forecast future outcomes.
o This historical perspective is crucial for understanding business evolution and
making strategic adjustments.
6. Improved Data Quality:
o The process of building a data warehouse involves data cleaning, transformation,
and integration, which helps to identify and rectify data quality issues present in
the source systems.
o This results in a more reliable and accurate data set for analysis.
7. Separation of Operational and Analytical Processing:
o Operational databases are designed for high-volume transactional processing and
are not optimized for complex analytical queries. Running such queries on
operational systems can slow down their performance and impact day-to-day
operations.
o A data warehouse provides a separate environment for analytical processing,
ensuring that the performance of operational systems is not affected.
8. Regulatory Compliance and Auditing:
o Many industries have regulations requiring the retention and analysis of historical
data. Data warehouses provide a structured and secure environment for storing
this data, facilitating compliance and auditing processes.

In conclusion, the development of a data warehouse is driven by the need for organizations to
transform their raw data into actionable insights, enabling better decision-making, enhancing
business intelligence, gaining a competitive edge, and improving overall business performance.
It provides a foundation for effective analysis and a deeper understanding of the organization's
past, present, and potential future.

Data Warehouse Systems and Components


A data warehouse system is a comprehensive framework that enables the storage, management,
and retrieval of data specifically for analytical and reporting purposes. It comprises several key
components that work together to achieve this goal.

Components of a Data Warehouse System

The architecture of a data warehouse system typically includes the following components:

1. Source Systems: These are the operational systems and external data sources that
provide the raw data for the data warehouse. Examples include:
o Transactional databases (OLTP systems)
o Customer Relationship Management (CRM) systems
o Enterprise Resource Planning (ERP) systems
o Flat files (e.g., CSV, Excel)
o Web logs
o External data feeds
2. ETL (Extraction, Transformation, and Loading) Process: This is the core process
responsible for moving data from the source systems to the data warehouse. It involves
three main stages:
o Extraction: Retrieving data from the various source systems. This may involve
selecting relevant data, handling different data formats, and dealing with data
access methods.
o Transformation: Cleaning, integrating, and transforming the extracted data into
a consistent and usable format for the data warehouse. This includes tasks like:
 Data cleaning (handling missing values, correcting errors, resolving
inconsistencies)
 Data integration (combining data from different sources, resolving naming
conflicts)
 Data transformation (converting data types, standardizing units, creating
derived attributes)
o Loading: Writing the transformed data into the data warehouse. This often
involves initial loading of historical data and subsequent periodic updates.
3. Data Warehouse: This is the central repository where the integrated and cleaned data is
stored. It is typically a relational database management system (RDBMS) optimized for
analytical queries. Common data warehouse database systems include:
o Teradata
o Snowflake
o Amazon Redshift
o Google BigQuery
o Microsoft Azure Synapse Analytics
4. Metadata Repository: This component stores "data about data." It contains information
about the data warehouse structure, data sources, ETL processes, data transformations,
data quality rules, security access, and user information. Metadata is crucial for
understanding, managing, and using the data warehouse effectively. Types of metadata
include:
o Technical Metadata: Describes the data warehouse structure (schemas, tables,
columns, data types), ETL processes, and data source information.
o Business Metadata: Provides business context to the data in the warehouse,
including business rules, data definitions, and user-friendly descriptions.
5. Data Warehouse Access Tools: These are the tools used by business users to interact
with the data in the warehouse for analysis, reporting, and decision-making. Common
types of access tools include:
o Query and Reporting Tools: Allow users to formulate SQL queries or use
graphical interfaces to generate reports and perform ad-hoc analysis.
o OLAP (Online Analytical Processing) Tools: Enable multi-dimensional
analysis of data, allowing users to slice and dice data to gain insights from
different perspectives.
o Data Mining Tools: Used to discover hidden patterns, trends, and relationships in
the data.
o Business Intelligence (BI) Platforms: Integrated suites of tools that provide a
comprehensive environment for data analysis, visualization, and reporting (e.g.,
Tableau, Power BI, Qlik Sense).
6. Data Marts (Optional): These are subject-oriented subsets of the data warehouse
designed to meet the specific analytical needs of particular business units or user groups
(e.g., marketing data mart, sales data mart). Data marts can improve query performance
and make data access easier for specific teams. They can be:
o Dependent Data Marts: Created directly from the central data warehouse.
o Independent Data Marts: Developed independently of the central data
warehouse, often from source systems. This approach can lead to data
inconsistencies and is generally less preferred.
7. Data Governance Framework: This encompasses the policies, procedures, and
responsibilities for ensuring the quality, security, and integrity of the data in the data
warehouse. It addresses aspects like data ownership, data standards, data access control,
and data auditing.

Design of a Data Warehouse


Designing an effective data warehouse is a crucial and iterative process. Several methodologies
and approaches exist, but the core principles remain consistent. The key steps involved in data
warehouse design include:

1. Planning and Requirements Gathering:


o Define Business Objectives: Understand the business goals and the analytical
needs that the data warehouse aims to address.
o Identify Information Needs: Determine the specific information required by
business users for decision-making.
o Define Scope: Clearly define the boundaries of the data warehouse project,
including the subject areas to be covered and the initial data sources.
o Gather User Requirements: Interview stakeholders and end-users to understand
their reporting and analytical requirements.
o Conduct Feasibility Study: Assess the technical and economic feasibility of
building the data warehouse.
2. Conceptual Modeling:
o Develop a high-level, business-oriented model of the data warehouse.
o Identify key business entities (e.g., customers, products, orders) and their
relationships.
o This model focuses on the business meaning of the data rather than the technical
details of implementation.
3. Logical Modeling:
o Translate the conceptual model into a logical schema that describes the structure
of the data warehouse.
o Dimensional Modeling: This is the most common approach for data warehouse
logical design. It involves identifying:
 Facts: Numerical measures or metrics that represent business events (e.g.,
sales amount, quantity sold). Fact tables typically contain foreign keys
referencing dimension tables.
 Dimensions: Descriptive attributes that provide context to the facts (e.g.,
customer demographics, product details, time periods, locations).
Dimension tables contain attributes that are used for filtering, grouping,
and labeling reports.
o Common dimensional models include:
 Star Schema: A central fact table surrounded by several dimension tables,
resembling a star.
 Snowflake Schema: Dimension tables are normalized, breaking them
down into multiple related tables, resembling a snowflake. While it
reduces data redundancy, it can increase query complexity.
 Galaxy Schema (Fact Constellation): Multiple fact tables share some
common dimension tables.
4. Physical Modeling:
o Define the physical implementation of the data warehouse on a specific database
management system.
o This involves decisions about:
 Table structures (including data types, constraints, and indexing strategies)
 Partitioning strategies to improve query performance and manage large
volumes of data.
 Storage structures and optimization techniques.
 Security measures and access controls.
5. ETL System Design:
o Design the processes for extracting, transforming, and loading data from source
systems into the data warehouse.
o This includes:
 Identifying data sources and extraction methods.
 Defining data cleaning and transformation rules.
 Designing the ETL workflow and scheduling.
 Selecting appropriate ETL tools or developing custom scripts.
 Implementing data quality checks and error handling mechanisms.
6. Data Warehouse Implementation:
o Build the data warehouse based on the physical model and implement the ETL
processes.
o This involves database creation, table creation, index creation, and the
development and testing of ETL workflows.
7. Deployment and Testing:
o Deploy the data warehouse and ETL system in the production environment.
o Conduct thorough testing of the entire system, including data loading, query
performance, and the functionality of access tools.
o Gather user feedback and make necessary adjustments.
8. Maintenance and Growth:
o Continuously monitor the performance and data quality of the data warehouse.
o Perform regular maintenance tasks, such as data backups, performance tuning,
and schema updates.
o Adapt the data warehouse to evolving business needs and incorporate new data
sources and requirements.

The design of a data warehouse is an iterative process, and the steps may overlap. A well-
designed data warehouse is crucial for providing timely, accurate, and insightful information that
supports effective business decision-making.

Dimensions and Measures in Data Warehousing


In the context of data warehousing and dimensional modeling, dimensions and measures are
fundamental concepts that form the core of fact tables and dimension tables. They work together
to provide a structured way to analyze business data.

Measures (Facts)

 Definition: Measures, often referred to as facts, are the numerical values that represent
the business events or transactions you want to analyze. They are typically quantitative
and can be aggregated (summed, averaged, counted, etc.).
 Characteristics:
o Numeric: Measures are always numeric.
o Aggregatable: The primary purpose of measures is to be aggregated across
different dimensions to gain insights.
o Quantitative: They represent quantities, amounts, or counts.
o Stored in Fact Tables: Measures reside in the fact tables of a dimensional model.
 Examples:
o Sales Amount: The total revenue generated from a sale.
o Quantity Sold: The number of units of a product sold.
o Profit: The difference between revenue and cost.
o Number of Orders: The count of individual orders placed.
o Inventory Level: The current stock of a product.
o Website Visits: The number of times a webpage was accessed.
o Temperature: A recorded temperature reading.

Dimensions

 Definition: Dimensions provide the context for the measures. They are descriptive
attributes that help you to categorize, group, and filter the measures for analysis.
Dimensions answer the questions of "who," "what," "where," "when," and "why" about
the business events.
 Characteristics:
o Descriptive: Dimensions contain textual or categorical data that describes the
business context.
o Qualitative: They represent qualities or characteristics.
o Used for Filtering and Grouping: Dimensions are used to slice and dice the
measures in reports and analytical queries.
o Stored in Dimension Tables: Dimension attributes are stored in dimension
tables, which are typically linked to the fact table through foreign keys.
 Examples of Common Dimensions:
o Time Dimension: Date, Month, Year, Quarter, Day of Week.
o Product Dimension: Product Name, Product Category, Brand, Color, Size.
o Customer Dimension: Customer Name, Customer ID, Age Group, Gender,
Location.
o Location Dimension: City, State, Country, Region.
o Organization Dimension: Department, Business Unit.
o Salesperson Dimension: Salesperson Name, Sales Team, Sales Region.

Relationship between Dimensions and Measures

 Fact Tables: Contain measures and foreign keys that reference the primary keys of the
dimension tables. Each row in a fact table represents a specific business event at the
intersection of the associated dimension members.
 Dimension Tables: Contain the descriptive attributes of each dimension. Each row in a
dimension table represents a unique member of that dimension.
 Analysis: By joining fact tables with dimension tables, you can analyze the measures
across different dimensional attributes. For example:
o "What was the total sales amount (measure) for each product category
(dimension) in each month (dimension)?
o "How many customers (measure - count of distinct customer IDs) are in each
age group (dimension) and each region (dimension)?

Key Differences Summarized

Feature Measure (Fact) Dimension


Type of Data Numeric, Quantitative Descriptive, Qualitative
Purpose To be analyzed and aggregated To provide context for analysis
Location Primarily in Fact Tables Primarily in Dimension Tables
Operation Aggregated (sum, avg, count) Used for filtering and grouping
Question Answered How much? How many? Who? What? Where? When? Why?
Export to Sheets

Understanding the distinction and relationship between dimensions and measures is crucial for
designing effective data warehouses and creating meaningful business intelligence reports and
analyses. They form the foundation of dimensional modeling, which is widely used for
organizing data for analytical purposes.

Data Marts: Dependent, Independent, and Distributed


Data marts are subject-oriented subsets of a data warehouse focused on providing data for a
specific business unit, department, or user group. They offer a more tailored and manageable
view of the data, potentially improving query performance and user access for specific analytical
needs. There are three main types of data marts: dependent, independent, and distributed.

1. Dependent Data Marts

 Definition: A dependent data mart is created directly from an existing enterprise data
warehouse. It relies on the central data warehouse as its single source of truth.
 Characteristics:
o Single Source of Data: Data is sourced exclusively from the enterprise data
warehouse.
o Consistency: Ensures data consistency across different data marts as they all
draw from the same integrated and cleaned data.
o Reduced Data Redundancy: Data is stored centrally, minimizing redundancy in
the data mart.
o Simplified ETL: The ETL process is primarily focused on building and
maintaining the central data warehouse. Data marts require a simpler extraction
and transformation process from the warehouse.
o Easier Integration: Integration between different dependent data marts is
straightforward as they share a common foundation.
o Scalability: The scalability of dependent data marts is tied to the scalability of the
central data warehouse.
 Architecture:

Source Systems
|
V
Enterprise Data Warehouse
|
---------------------
| |
V V
Data Mart A Data Mart B
(e.g., Sales) (e.g., Marketing)

 Advantages:
o Data Consistency: The most significant advantage is data consistency across the
organization.
o Reduced Development Time: Building dependent data marts is generally faster
as the core data integration and cleaning are already done in the central
warehouse.
o Simplified Management: Easier to manage and maintain as the data origin is
centralized.
o Improved Data Governance: Benefits from the data governance policies
established for the enterprise data warehouse.
 Disadvantages:
o Reliance on Central Warehouse: The availability and performance of dependent
data marts are tied to the central data warehouse. Any issues with the central
warehouse can impact the data marts.
o Potential Bottleneck: If the central data warehouse is not well-designed or
performs poorly, it can become a bottleneck for accessing data in the data marts.
o Less Flexibility: Less flexibility in terms of data modeling and integration
specific to the needs of a particular business unit if the central warehouse design
doesn't fully accommodate them.

2. Independent Data Marts

 Definition: An independent data mart is a standalone data warehouse focused on a


specific subject or business unit, built directly from the operational systems or external
data sources without relying on a central data warehouse.
 Characteristics:
o Direct Data Sources: Data is extracted, transformed, and loaded directly from
source systems relevant to the specific business need.
o Subject-Oriented: Focused on a particular business area (e.g., sales, finance,
marketing).
o Potential for Data Inconsistency: Data may not be consistent with other data
marts or the rest of the organization if different integration and cleaning rules are
applied.
o Complex ETL: Each independent data mart requires its own ETL process to
extract, clean, and transform data from its specific sources.
o Data Redundancy: Data related to the same entities might be duplicated across
different independent data marts.
o Difficult Integration: Integrating data across different independent data marts
can be challenging due to potential inconsistencies in data structures, definitions,
and quality.
 Architecture:

Source System A --> ETL A --> Data Mart A (e.g., Sales)


|
Source System B --> ETL B --> Data Mart B (e.g., Marketing)
|
External Data C --> ETL C --> Data Mart C (e.g., Finance)

 Advantages:
o Faster Implementation (Initially): Can be implemented quickly for specific
needs without waiting for the development of a large enterprise data warehouse.
o Tailored to Specific Needs: Can be designed and optimized to meet the exact
analytical requirements of a particular business unit.
o Flexibility: Offers greater flexibility in terms of data modeling and source system
integration for a specific area.
 Disadvantages:
o Data Inconsistency: The biggest drawback is the potential for inconsistent data
across different data marts and the rest of the organization.
o Data Redundancy: Increased data redundancy leads to higher storage costs and
potential inconsistencies.
o Complex Management: Managing multiple independent ETL processes and data
marts can be complex and resource-intensive.
o Difficult Integration: Integrating data across different independent data marts for
a holistic view of the business is challenging.
o Lack of Enterprise View: Does not provide a unified view of the organization's
data.

3. Distributed Data Marts

 Definition: A distributed data mart architecture involves having multiple data marts that
are interconnected and can share data. This approach attempts to combine some of the
benefits of both dependent and independent data marts.
 Characteristics:
o Interconnected: Data marts can communicate and exchange data.
o Subject-Oriented: Each data mart focuses on a specific business area.
o Potential for Controlled Redundancy: Some data might be replicated across
data marts for performance or accessibility reasons, but ideally, it's managed and
controlled.
o Complex Architecture: Requires careful design and management to ensure data
consistency and efficient data sharing.
o Federated Approach: Can be seen as a federated approach where each data mart
is autonomous but participates in a larger data sharing framework.
 Architecture (Conceptual):

Source Systems
|
V
(Optional) Staging Area
|
---------------------
| |
V V
Data Mart A <-------> Data Mart B <-------> Data Mart C
(e.g., Sales) (e.g., Marketing) (e.g., Finance)
^ ^ ^
| | |
(Potentially shared/derived data)

 Advantages:
o Flexibility and Scalability: Allows for flexibility in designing data marts for
specific needs while still enabling data sharing and integration.
o Improved Performance: Data marts can be optimized for specific user groups,
potentially improving query performance.
o Hybrid Approach: Can leverage existing independent data marts while gradually
integrating them or building new dependent data marts.
 Disadvantages:
o Complexity: Designing and managing a distributed data mart environment is
more complex than either a purely dependent or independent approach.
o Data Consistency Challenges: Ensuring data consistency across interconnected
data marts requires careful planning and implementation of data sharing and
synchronization mechanisms.
o Governance Issues: Establishing and enforcing data governance policies across
distributed data marts can be challenging.

In Summary:

 Dependent Data Marts: Best when data consistency and a single version of the truth are
paramount, and a robust enterprise data warehouse exists.
 Independent Data Marts: May be considered for very specific, isolated needs or as a
starting point before building a full data warehouse, but they often lead to data
inconsistencies and management challenges in the long run.
 Distributed Data Marts: Offer a compromise, providing flexibility while attempting to
maintain some level of integration and data sharing, but they require careful design and
management to avoid complexity and consistency issues.

The choice of data mart architecture depends on the organization's specific needs, existing
infrastructure, data governance policies, and long-term data warehousing strategy. In most
modern data warehousing environments, a dependent data mart approach built upon a well-
designed enterprise data warehouse is generally preferred for its benefits in terms of data
Conceptual Modeling of Data Warehouses
Conceptual modeling is the initial high-level stage in data warehouse design. It focuses on
understanding and representing the business requirements and the key business entities and their
relationships, without delving into the technical details of database implementation. The goal is
to create a business-oriented model that serves as a blueprint for the subsequent logical and
physical design phases.

Key aspects of conceptual modeling for data warehouses include:

 Identifying Business Processes: Understanding the core business activities that need to
be analyzed (e.g., sales, orders, shipments, customer interactions).
 Identifying Key Business Entities: Determining the central subjects of analysis related
to these processes (e.g., Customers, Products, Time, Locations, Salespeople).
 Defining Relationships: Understanding how these entities interact within the business
processes.
 Identifying Key Measures: Determining the quantitative data points that need to be
tracked and analyzed (e.g., Sales Amount, Quantity Sold, Profit, Number of Visits).
 Defining Dimensions: Identifying the contextual attributes that provide different
perspectives for analyzing the measures (e.g., Product Category, Customer Segment,
Sales Region, Month).

The output of conceptual modeling is often a high-level diagram or description that


communicates the scope and key elements of the data warehouse to business stakeholders. It
serves as a foundation for the more detailed logical modeling phase, where specific data
structures and relationships are defined.

Logical Modeling: Star Schema, Snowflake Schema, and


Fact Constellations
Logical modeling translates the conceptual model into a database schema that can be
implemented in a data warehouse system. The most common logical modeling technique for data
warehouses is dimensional modeling, which focuses on organizing data into facts and
dimensions. Three primary types of dimensional schemas are:

1. Star Schema

 Structure: The star schema is characterized by a central fact table surrounded by several
dimension tables, resembling a star.
 Fact Table: Contains the primary measures (quantitative data) and foreign keys that
reference the primary keys of the dimension tables. It represents the relationships
between the dimensions for a specific business event.
 Dimension Tables: Contain the descriptive attributes that provide context to the facts.
They are typically denormalized, meaning they may contain redundant data to simplify
querying and improve performance. Each dimension table has a primary key that is
referenced by one or more foreign keys in the fact table.
 Simplicity: The star schema is relatively simple to understand and query, making it
popular for OLAP applications.
 Query Performance: The denormalized dimension tables often lead to fewer joins in
queries, which can improve performance.

Example:

Time Dimension
(Time_ID, Year, Quarter, Month, Day)
|
| FK
-----------------
| |
| Sales Fact |
|---------------|
| Sales_ID (PK) |
| Product_ID (FK)|-------> Product Dimension
| Customer_ID (FK)|-------> (Product_ID, Product_Name,
Category, Brand)
| Time_ID (FK) |
| Location_ID (FK)|-------> Location Dimension
| Sales_Amount | (Location_ID, City, State,
Country)
| Quantity_Sold |
| ... |
-----------------
|
| FK
Customer Dimension
(Customer_ID, Customer_Name, Age_Group, Gender)

2. Snowflake Schema

 Structure: The snowflake schema is an extension of the star schema where the
dimension tables are normalized. This means that the dimension tables are further
broken down into multiple related tables to eliminate redundancy.
 Normalization: Dimension attributes are organized into hierarchical structures, creating
a snowflake-like shape when diagrammed. For example, a Location dimension might be
split into Location, City, State, and Country tables.
 Reduced Data Redundancy: Normalization reduces data redundancy and can improve
data integrity.
 Increased Query Complexity: Querying a snowflake schema often requires more joins
compared to a star schema, which can potentially impact query performance.
 Data Integrity: The normalized structure can enforce data integrity more effectively.

Example (extending the Star Schema):


Time Dimension
(Time_ID, Year, Quarter, Month, Day)
|
| FK
-----------------
| |
| Sales Fact |
|---------------|
| Sales_ID (PK) |
| Product_ID (FK)|-------> Product Dimension
| Customer_ID (FK)|-------> (Product_ID, Product_Name,
Category_ID)
| Time_ID (FK) | | FK
| Location_ID (FK)|-------> |
| Sales_Amount | Category Dimension
| Quantity_Sold | (Category_ID, Category_Name)
| ... |
-----------------
|
| FK
Customer Dimension
(Customer_ID, Customer_Name, Demographic_ID)
| FK
|
Demographic Dimension
(Demographic_ID, Age_Group, Gender)

|
| FK
Location Dimension
(Location_ID, City_ID)
| FK
|
City Dimension
(City_ID, City_Name, State_ID)
| FK
|
State Dimension
(State_ID, State_Name, Country_ID)
| FK
|
Country Dimension
(Country_ID, Country_Name)

3. Fact Constellations (Galaxy Schema)

 Structure: A fact constellation, also known as a galaxy schema, consists of multiple fact
tables that share some common dimension tables. This model is used when there are
multiple business processes with shared dimensions.
 Multiple Fact Tables: Each fact table represents a different business process (e.g., sales,
shipping, billing).
 Shared Dimensions: Dimension tables like Time, Product, and Customer can be shared
across these multiple fact tables, ensuring consistency in the dimensional data.
 Complexity: Fact constellations can be more complex to design and manage than single
fact table schemas.
 Comprehensive View: This model allows for a more comprehensive view of the
business by integrating data from different but related business processes.

Example:

Time Dimension
(Time_ID, Year, Quarter, Month, Day)
^
|
---------------------------------------
| |
V V
----------------- --------------------
| | | |
| Sales Fact | | Shipping Fact |
|---------------| |------------------|
| Sales_ID (PK) | | Shipment_ID (PK) |
| Product_ID (FK)|--------------------->| Product_ID (FK) |
| Customer_ID (FK)|--------------------->| Customer_ID (FK) |
| Time_ID (FK) |<--------------------->| Time_ID (FK) |
| Location_ID (FK)| | Location_ID (FK) |
| Sales_Amount | | Shipment_Date |
| Quantity_Sold | | Delivery_Date |
| ... | | Shipping_Cost |
----------------- --------------------
^
|
Product Dimension
(Product_ID, Product_Name, Category, Brand)

^
|
Customer Dimension
(Customer_ID, Customer_Name, Age_Group, Gender)

^
|
Location Dimension
(Location_ID, City, State, Country)

Choosing the Right Schema:

The choice between star, snowflake, and fact constellation schemas depends on various factors,
including:

 Query Performance Requirements: Star schemas generally offer better query


performance due to fewer joins.
 Data Redundancy Tolerance: Star schemas have higher data redundancy in dimension
tables, while snowflake schemas minimize it.
 Complexity of the Business Domain: Fact constellations are suitable for complex
scenarios with multiple interrelated business processes.
 Ease of Understanding and Implementation: Star schemas are the simplest to
understand and implement.
 Data Integrity Needs: Snowflake schemas can offer better data integrity due to
normalization.

In practice, a star schema or a variation of it is often the preferred choice for its balance of
simplicity and performance. Snowflake schemas might be used for specific dimensions where
data integrity and reduced redundancy are critical, even if it slightly increases query complexity.
Fact constellations are employed when modeling multiple, interconnected business processes
that share common dimensions.

Multidimensional Data Model and

Aggregates.

Multidimensional Data Model


The multidimensional data model is a way of organizing data in a database or data warehouse
to facilitate Online Analytical Processing (OLAP). It structures data in the form of a data cube,
which allows for analysis from multiple perspectives (dimensions).

Key Concepts:

 Dimensions: These are the categories that define the perspectives for analyzing the data.
They represent the "who," "what," "where," and "when" of the data. Examples include:
o Time (Year, Quarter, Month, Day)
o Product (Category, Sub-Category, Product Name)
o Location (Country, State, City)
o Customer (Segment, Region, Individual Customer)
 Measures (Facts): These are the numerical values that you want to analyze. They
represent the "how much" or "how many." Measures are typically quantitative and can be
aggregated. Examples include:
o Sales Amount
o Quantity Sold
o Profit
o Number of Orders
o Website Visits
 Cells: Each cell in the multidimensional cube represents a specific combination of
dimension members and contains the value of the measure(s) for that combination.
 Hierarchies: Dimensions often have hierarchical relationships. For example, the Time
dimension can have a hierarchy of Year -> Quarter -> Month -> Day. These hierarchies
allow for drilling down (going from a higher level to a lower level of detail) and rolling
up (aggregating data from a lower level to a higher level).
 Levels: Each level in a hierarchy represents a different granularity of the dimension (e.g.,
Year level, Month level).
 Attributes: Dimensions can have attributes that provide further descriptive information
about the dimension members (e.g., Product Name is an attribute of the Product
dimension).

OLAP Operations on Multidimensional Data:

The multidimensional model supports various OLAP operations that enable users to analyze data
from different angles:

 Roll-up (Aggregation): Summarizing data along a dimension hierarchy (e.g.,


aggregating daily sales to monthly sales).
 Drill-down: Navigating from a higher level of a hierarchy to a more detailed level (e.g.,
from yearly sales to quarterly sales).
 Slice: Selecting a subset of the data cube by fixing one or more dimensions to a single
value (e.g., showing sales data for a specific year).
 Dice: Selecting a sub-cube by specifying a range of values for one or more dimensions
(e.g., showing sales data for a specific set of products in a specific time period).
 Pivot (Rotate): Changing the orientation of the data cube to view it from different
perspectives (e.g., swapping rows and columns in a report).

Aggregates in Data Warehousing


Aggregates are pre-calculated summary data derived from the base fact tables in a data
warehouse. They store data at a higher level of granularity along one or more dimensions. The
primary purpose of aggregates is to improve query performance by reducing the amount of
data that needs to be scanned and processed to answer common analytical queries.

Why Use Aggregates?

 Faster Query Response Times: By pre-calculating summaries, queries that require


aggregations (e.g., total sales by month, average profit by product category) can be
answered much faster as the results are already computed and stored.
 Reduced System Load: Less processing is required at query time, which reduces the
load on the data warehouse system and allows it to handle more concurrent users and
queries.
 Improved User Experience: Faster query response times lead to a better and more
interactive experience for business users performing analysis and generating reports.

Types of Aggregates:

Aggregates can be created at different levels of granularity for various dimensions and
combinations of dimensions. For example:

 Aggregating by Time: Total sales by year, quarter, month.


 Aggregating by Geography: Total sales by country, state, city.
 Aggregating by Product: Total sales by product category, sub-category.
 Aggregating by Combinations: Total sales by product category and month, average
profit by region and quarter.

Design Considerations for Aggregates:

 Identify Common Queries: Analyze query patterns and user needs to determine which
aggregations will provide the most significant performance benefits.
 Balance Performance and Storage: Creating too many aggregates can improve
performance but also significantly increase storage space and the time required for ETL
processes to maintain them. A balance needs to be struck based on query frequency and
storage capacity.
 Aggregate Navigation: The data warehouse system should have a mechanism (often
called an aggregate navigator or query rewrite engine) to automatically identify the most
appropriate aggregate table to use when a query is submitted, making the use of
aggregates transparent to the end-user.
 Maintenance: Aggregates need to be updated whenever new data is loaded into the base
fact tables, which adds to the complexity and processing time of the ETL process.

Implementation of Aggregates:

Aggregates can be implemented in various ways:

 Materialized Views: Many database systems support materialized views, which are pre-
computed and stored results of SQL queries. These can be used to create and manage
aggregates.
 Summary Tables: Separate tables can be explicitly created in the data warehouse to
store aggregated data.
 OLAP Cubes: OLAP cubes inherently store data in an aggregated format along different
dimensions and hierarchies.

In conclusion, the multidimensional data model provides a powerful way to structure data for
analytical purposes, and aggregates are a crucial technique for optimizing the performance of
data warehouses by pre-calculating and storing summary data. The effective design and
management of both the data model and aggregates are essential for building a high-performing
and user-friendly business intelligence system.

You might also like