Unit 2-DATA WAREHOUSE
Unit 2-DATA WAREHOUSE
UNIT 2
Data Warehouse
• A Data Warehouse (DW) is a relational database that is designed for query
and analysis rather than transaction processing.
• It includes historical data derived from transaction data from single and
multiple sources.
• A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modeling and
analysis.
• A Data Warehouse can be viewed as a data system with the following
attributes:
• It is a database designed for investigative tasks, using data from various applications.
• It supports a relatively small number of clients with relatively long interactions.
• It includes current and historical data to provide a historical perspective of information.
• Its usage is read-intensive.
• It contains a few large tables.
• Data warehousing is a technology that enables businesses to store,
manage, and analyze large volumes of data from various sources in a
centralized repository. The primary goal of data warehousing is to
provide a comprehensive and integrated view of an organization's
data to support informed decision-making.
Advantages
• It provides a central repository for critical data, making it easy for business
users to access information from various sources.
• By providing a consolidated view of data from different sources, data
warehouses enable organizations to make informed decisions based on
accurate and consistent data.
• Integrates multiple data sources to reduce stress on the production system and
reduces the total turnaround time for analysis and reporting.
• Data warehouses provide historical data that can be used to identify trends and
patterns over time, leading to better decision-making and planning.
• Restructures and integrates data to make it easier for users to use for reporting
and analysis.
• Saves time by allowing users to access critical data from multiple sources in a
single place.
• Data warehouses can easily scale to meet the needs of growing organizations,
allowing them to store and analyze large volumes of data.
Disadvantages
• Implementing and maintaining a data warehouse can be expensive,
including hardware, software, and personnel costs.
• Not suitable for unstructured data.
• Not suitable for `real-time or near-real-time data processing.
• Integrating data from multiple sources into a single data warehouse
can be complex and time-consuming.
• Data in the warehouse may become outdated quickly.
• Changes in data types and ranges, data source schema, indexes, and
queries can be challenging to implement.
Difference between Database & Data Warehouses
Aspect Database Data Warehouse
Purpose Primarily for transactional operations Primarily for analytical operations
Data Type Handles structured data Handles structured and unstructured data
Schema Generally follows a normalized schema Often follows a denormalized schema
Data Volume Usually handles smaller data volumes Handles large volumes of historical data
Optimized for complex queries and
Performance Optimized for read and write operations
reporting
Data Freshness Emphasizes real-time data Emphasizes historical and periodic data
Query Complexity Supports simpler, real-time queries Supports complex, analytical queries
Data
Limited emphasis on data transformations Involves significant data transformations
Transformations
Usage Used for day-to-day operations Used for decision-making and analysis
Data Model OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Multidimensional Data Model
• The multidimensional data model is a type of data model
used primarily in data warehousing that organizes data into
multiple dimensions, each representing a specific attribute
of the data.
• It typically uses a cube structure to organize this data and
supports high-performance querying for analytical reports.
• This structure helps simplify the analysis of large and
complex data sets.
• Multidimensional data model in data warehouse is a model which
represents data in the form of data cubes.
• It allows to model and view the data in multiple dimensions and it is
defined by dimensions and facts.
• Multidimensional data model is generally categorized around a
central theme and represented by a fact table.
Multidimensional Schema
• Multidimensional Schema is especially designed to model
data warehouse systems.
• The schemas are designed to address the unique needs of very
large databases designed for the analytical purpose (OLAP).
• Types of Data Warehouse Schema:
• Following are 3 main types of multidimensional schemas each having
its unique advantages.
1. Star Schema
2. Snowflake Schema
3. Galaxy Schema
Star Schema
• Star Schema in data warehouse, is a schema in which the
center of the star can have one fact table and a number of
associated dimension tables.
• It is known as star schema as its structure resembles a star.
• The Star Schema data model is the simplest type of Data
Warehouse schema and also known as Star Join Schema
and is optimized for querying large data sets.
Example
In the above Star Schema example, the fact table is at the center
which contains keys to every dimension table like Dealer_ID, Model
ID, Date_ID, Product_ID, Branch_ID & other attributes like Units sold
and revenue.
Fact Tables
• A Fact table in a star schema contains facts and is connected to
dimensions.
• A fact table has two types of columns:
• A column that includes Facts
• Foreign Key to Dimensions Table
• Generally, the primary key of a fact table is a composite key that is
made up of all the foreign keys that make up the table.
• Fact tables can contain detail-level facts or aggregated facts. Fact
tables that include aggregated facts are often called summary
tables. Fact tables usually contain facts that have been aggregated
to some level.
Dimension Tables
• A dimension is an architecture that categorizes data in a hierarchy.
• A dimension without hierarchies and levels is called a flat dimension
or list.
• Each dimension table’s primary key is part of the composite primary
key of the fact table.
• A dimension attribute is a descriptive, textual attribute that helps
describe a dimensional value.
• Fact tables are usually larger than dimension tables.
Characteristics of Star Schema
• Every dimension in a star schema is represented with the only one-
dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key.
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would
have.
• The schema is widely supported by BI Tools
Advantages of Star Schema
• Star schemas have a more straightforward join logic compared to other
schemas for fetching data from highly normalized transactional schemas.
• As opposed to highly normalized transactional schemas, the star schema
simplifies common business reporting logic, such as reporting and
period-over-period.
• Star schemas are widely used by OLAP systems to design cubes
efficiently. A star schema can be used as a source without designing a
cube structure in most major OLAP systems.
• By enabling specific performance schemes that can be applied to
queries, the query processor software in Star Schema can offer better
execution plans.
Disadvantage of Star Schema
• Since the schema is highly de-normalized, data integrity is not
enforced well.
• Not flexible in terms of analytical needs.
• Star schemas do not reinforce many-to-many relationships within
business entities.
Snowflake Schema
• Snowflake Schema in data warehouse is a logical arrangement of
tables in a multidimensional database such that the ER
diagram resembles a snowflake shape.
• A Snowflake Schema is an extension of a Star Schema, and it adds
additional dimensions.
• The dimension tables are normalized which splits data into
additional tables.
Example
In the following Snowflake Schema example, Country is further normalized into an
individual table.
Characteristics of Snowflake Schema
• The main benefit of the snowflake schema it uses smaller disk
space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts
because of the more lookup tables.
Advantage of Snowflake Schema
• Snowflake schema’s primary advantage is its ability to reduce disk
storage requirements and join smaller lookup tables, improving
query performance.
• Provides greater scalability in the interrelationship between
components and dimension levels.
• There is no redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
• A significant disadvantage of the snowflake schema is the increased
maintenance required.
• Complex queries are challenging to understand.
• A larger number of tables means more joins, so a longer query
execution time.
Galaxy Schema
• A Galaxy Schema contains two fact table that share dimension
tables between them.
• It is also called Fact Constellation Schema.
• The schema is viewed as a collection of stars hence the name Galaxy
Schema.
Example