Unit - 4
Unit - 4
UNIT - IV
DIMENSIONAL MODELING
AND
SCHEMA
UNIT-IV 4. 1
Paavai Institutions Department of CSE
CONTENTS
4.1 D I M E N S I O N A L M O D E L I N G
4.2 M U L T I - D I M E N S I O N A L D A T A M O D E L I N G
4.3 DATA CUBE
4.4 STAR SCHEMA
4.5 SNOWFLAKE SCHEMA
4.6 SRAR VS SNOWFLAKE SCHEMA
4.7 FACT CONSTELLATION SCHEMA
4.8 SCHEMA DEFINITION
4.8.1 STAR SCHEMA
4.8.2 SNOWFLAKE SCHEMA
4.8.3 GALAXY SCHEMA
4.9 PROCESS ARCHITECTURE
QUESTION BANK
UNIT-IV 4. 2
Paavai Institutions Department of CSE
TECHNICAL TERMS
Measure of
A logical structure of data http://www.yourdictionary.com/
6 Fact online analytical
Constellation warehouse or data mart.
processing
A central repository of
Holds highly http://www.yourdictionary.com/
8 Data information that can be
structured data
Warehouse analyzed to make more
informed decisions.
UNIT-IV 4. 3
Paavai Institutions Department of CSE
4.1 D I M E N S I O N A L M O D E L I N G
Dimensional modeling represents data with a cube operation, making more suitable logical
data representation with OLAP data management. The perception of Dimensional Modeling was developed
by Ralph Kimball and is consist of "fact" and "dimension" tables.
1. To produce database architecture that is easy for end-clients to understand and write queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number of tables and
relationships between them.
Fact
It is a collection of associated data items, consisting of measures and context data. It typically represents
business items or business transactions.
Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the contextual
background for the facts, and they are the framework over which OLAP is performed.
Measure
It is a numeric attribute of a fact, representing the performance or behavior of the business relative to the
dimensions.
Considering the relational context, there are two basic models which are used in dimensional modeling:
o Star Model
o Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table
(fact table) and a set of smaller tables (dimensions) arranged in a radial design around the primary
table.
UNIT-IV 4. 4
Paavai Institutions Department of CSE
The snowflake model is the conclusion of decomposing one or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data elements that
are of interest to the company.
The fact table includes numerical values of what we measure. For example, a fact value of 20 might
means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign keys in
the fact table.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.
Dimension tables contain the details about the facts. That, as an example, enables the business analysts
to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table. That is, they
contain the attributes of the facts. For example, the dimension tables for a marketing analysis function
might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is demoralized, it usually has a large number of columns. The
dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query
results display.
UNIT-IV 4. 5
Paavai Institutions Department of CSE
Example: A city and state can view a store summary in a fact table. Item summary can be viewed by brand,
color, etc. Customer information can be viewed by name and address.
Dimensional modeling is simple: Dimensional modeling methods make it possible for warehouse
designers to create database schemas that business customers can easily hold and comprehend.
Dimensional modeling promotes data quality: The star schema enable warehouse administrators to
enforce referential integrity checks on the data warehouse.
UNIT-IV 4. 6
Paavai Institutions Department of CSE
By enforcing foreign key constraints as a form of referential integrity check, data warehouse DBAs
add a line of defense against corrupted warehouses data.
Performance optimization is possible through aggregates: As the size of the data warehouse increases,
performance optimization develops into a pressing concern
It is severe to modify the data warehouse operation if the organization adopting the dimensional
technique changes the method in which it does business.
4.2 M U L T I - D I M E N S I O N A L D A T A M O D E L I N G
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time,
item, and location. Each dimension has a table related to it, called a dimensional table, which describes the
dimension further. For example, a dimensional table for an item may contain the attributes item name, brand,
and type.
UNIT-IV 4. 7
Paavai Institutions Department of CSE
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In
this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the
item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee sold
(in thousands).
UNIT-IV 4. 8
Paavai Institutions Department of CSE
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to
time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These
3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
Measures: Measures are numerical data that can be analyzed and compared, such as sales or
revenue. They are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or
product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures and
dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of
detail. This is a key feature of multidimensional data models, as it enables users to quickly analyze
data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of data
to a lower level of detail, while roll-up is the opposite process of moving from a lower-level detail to
a higher-level summary. These features enable users to explore data in greater detail and gain
insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a
time dimension might be organized into years, quarters, months, and days. Hierarchies provide a
way to navigate the data and perform drill-down and roll-up operations.
UNIT-IV 4. 9
Paavai Institutions Department of CSE
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle complex
queries and provide fast response times.
The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to
recognize and examine the data in the database.
During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect on
the working of the system.
It is complicated in nature due to which the databases are generally dynamic in design.
The path to achieving the end product is complicated most of the time.
As the Multi Dimensional Data Model has complicated systems, databases have a large number of
databases due to which the system is very insecure when there is a security break.
UNIT-IV 4. 10
Paavai Institutions Department of CSE
Multidimensional data cube: It basically helps in storing large amounts of data by making use of a multi-
dimensional array. It increases its efficiency by keeping an index of each dimension. Thus, dimensional
is able to retrieve data fast.
Relational data cube: It basically helps in storing large amounts of data by making use of relational
tables. Each relational table displays the dimensions of the data cube. It is slower compared to a
Multidimensional Data Cube.
Data cube operations are used to manipulate data to meet the needs of users. These operations help
to select particular data for the analysis purpose. Data cube operations are used to manipulate data to meet
the needs of users. These operations help to select particular data for the analysis purpose.
UNIT-IV 4. 11
Paavai Institutions Department of CSE
Roll-up: operation and aggregate certain similar data attributes having the same dimension together.
Drill-down: this operation is the reverse of the roll-up operation. It allows us to take particular
information and then subdivide it further for coarser granularity analysis.
Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension, the user
doesn’t need everything for analysis, rather a particular attribute.
Dicing: this operation does a multidimensional cutting that not only cuts only one dimension but also can
go to another dimension and cut a certain range of it.
Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business data, allowing
users to view data from different perspectives and levels of detail.
Interactivity: Data cubes provide interactive access to large amounts of data, allowing users to easily
navigate and manipulate the data to support their analysis.
Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and efficient querying
and aggregation of data.
UNIT-IV 4. 12
Paavai Institutions Department of CSE
Data aggregation: Data cubes support complex calculations and data aggregation, enabling users to
quickly and easily summarize large amounts of data.
Helps in giving a summarised view of data.
Data cubes store large data in a simple way.
Data cube operation provides quick and better analysis,
Improve performance of data.
Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical
expertise.
Data size limitations: OLAP systems can struggle with very large data sets and may require extensive
data aggregation or summarization.
Performance issues: OLAP systems can be slow when dealing with large amounts of data, especially
when running complex queries or calculations.
Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need for
specialized hardware and software.
4.4 STAR SCHEMA
A star schema is a type of data modeling technique used in data warehousing to represent data in a
structured and intuitive way. In a star schema, data is organized into a central fact table that contains the
measures of interest, surrounded by dimension tables that describe the attributes of the measures.
The fact table in a star schema contains the measures or metrics that are of interest to the user or
organization. For example, in a sales data warehouse, the fact table might contain sales revenue, units sold,
and profit margins.
Each record in the fact table represents a specific event or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the measures in the fact
table. These attributes are used to slice and dice the data in the fact table, allowing users to analyze the data
from different perspectives.
For example, in a sales data warehouse, the dimension tables might include product, customer, time,
and location.
UNIT-IV 4. 13
Paavai Institutions Department of CSE
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension tables.
Employee dimension table contains the attributes: Emp ID, Emp Name, Title, Department and
Region. Product dimension table contains the attributes: Product ID, Product Name, Product Category, Unit
Price.
Customer dimension table contains the attributes: Customer ID, Customer Name, Address, City,
Zip. Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month.
Advantages of Star Schema :
1. Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic which are needed to fetch data
from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star schema makes simpler
common business reporting logic, such as of reporting and period-over-period.
3. Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP
systems deliver a ROLAP mode of operation which can use a star schema as a source without designing
a cube structure.
UNIT-IV 4. 14
Paavai Institutions Department of CSE
For Example, in a sales data warehouse, the product dimension table might be normalized into
multiple related tables, such as product category, product subcategory, and product details. Each of these
tables would be related to the product dimension table through a foreign key relationship.
Example:
UNIT-IV 4. 15
Paavai Institutions Department of CSE
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with the Employee table with
the Department dimension table. The Department dimension is used to provide detail about each
department, such as the Name and Location of the department. The Customer dimension table now
contains the attributes: CustomerID, CustomerName, Address, and CityID. The CityID attributes link
the Customer dimension table with the City dimension table. The City dimension table has details about
each city such as city name, Zipcode, State, and Country.
Star Schema: Star schema is the type of multidimensional model which is used for data
warehouse. In star schema, The fact tables and the dimension tables are contained.
In this schema fewer foreign-key join is used. This schema forms a star with fact table and
dimension tables.
UNIT-IV 4. 16
Paavai Institutions Department of CSE
In snowflake schema, The fact tables, dimension tables as well as sub dimension tables are
contained. This schema forms a snowflake with fact tables, dimension tables as well as sub-
dimension tables.
UNIT-IV 4. 17
Paavai Institutions Department of CSE
S.N
O Star Schema Snowflake Schema
In star schema, The fact tables and the While in snowflake schema, The fact tables, dimension
1.
dimension tables are contained. tables as well as sub dimension tables are contained.
It takes less time for the execution of While it takes more time than star schema for the execution
4.
queries. of queries.
In star schema, Normalization is not While in this, Both normalization and denormalization are
5.
used. used.
The query complexity of star schema is While the query complexity of snowflake schema is higher
7.
low. than star schema.
UNIT-IV 4. 18
Paavai Institutions Department of CSE
S.N
O Star Schema Snowflake Schema
9. It has less number of foreign keys. While it has more number of foreign keys.
10. It has high data redundancy. While it has low data redundancy.
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema. Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
UNIT-IV 4. 19
Paavai Institutions Department of CSE
Fact Constellation Schema can implement between aggregate Fact tables or decompose a complex
Fact table into independent simplex Fact tables.
A fact constellation schema has multiple fact tables. It is also known as galaxy schema. It is a widely
used schema and more complex than star schemas and snowflake schemas. It is possible to create a fact
constellation schema by splitting the original star schema into more star schemas. It has many fact tables and
some common dimension tables.
UNIT-IV 4. 20
Paavai Institutions Department of CSE
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes keys to each
of the four dimensions, along with two measures: Rupee_sold and units_sold.
The shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and
to_location, and two measures: Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
Schema in general terms means “Logical Structure”. So, data warehouse schema describes the logical
structure of any data warehouse containing records.
Also, the concept behind schema of data warehouse is same as that in data bases. Relational data
models are used by data bases for their logical structure while data warehouses uses schema for the
same purpose.
The schema in data warehouses are used to get the knowledge of complexity of a structure of data
warehouse.
They are basically the representation of the outer model or the way to logically deduce the results from
the figure and these figures are made from combinations of fact tables and dimension tables.
The conceptual modelling of warehouse comprises of three models. These are:-
UNIT-IV 4. 21
Paavai Institutions Department of CSE
A fact table contains keys to dimension tables which can be referred by using the concept of foreign
key.
A dimension table is one that consists of keys to facts present in fact table and their corresponding
attributes.
Simplest structural description of any data warehouse and is the least complex among all the
dimension models i.e. Snowflake and Galaxy schema.
The star schema contains a single fact table that is connected to multiple dimension tables in the form
of shape of star.
The basic structure enables star schema to perform functionalities and can handle only simple data
mining queries.
Each dimension table contains information of attributes present in fact table.
UNIT-IV 4. 22
Paavai Institutions Department of CSE
UNIT-IV 4. 23
Paavai Institutions Department of CSE
The snowflake schema describes the logical structure in much more detail as compared to star schema.
Snowflake schema is more complex than Star schema but less complex than Galaxy(Fact
constellation) Schema.
The major difference between snowflake and star schema is, star schema contains data in the form of
fact tables which are not normalized but in case of snowflake schema, data is normalized.
This schema is called as snowflake because it portrays the shape similar that of a snowflake with fact
table connected to multiple dimension tables that are drawn out from other dimension tables. This
results in more use of joins resulting in performance throttling.
UNIT-IV 4. 24
Paavai Institutions Department of CSE
Galaxy schema is also known as fact constellation. Fact constellation refers to combination of fact
tables and dimension tables using joins.
Multiple star schema are connected together to form galaxy schema.
Highly flexible.
No data redundancy.
Low memory/space required.
UNIT-IV 4. 25
Paavai Institutions Department of CSE
Complicated design.
To create, implement and maintain galaxy schema is a tough job.
More complex queries are required because of higher number of joins used to connect fact and
dimension tables.
Data analysis is difficult because of complex structure.
The process architecture defines an architecture in which the data from the data warehouse is
processed for a particular computation.
In this architecture, the data is collected into single centralized storage and processed upon completion
by a single machine with a huge structure in terms of memory, processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service It requires minimal resources both from people and system
perspectives. It is very successful when the collection and consumption of data occur at the same location.
UNIT-IV 4. 26
Paavai Institutions Department of CSE
In this architecture, information and its processing are allocated across data centers, and its processing
is distributed across data centers, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central location, and results are available
in one central location.
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.
Three-tier Architecture
UNIT-IV 4. 27
Paavai Institutions Department of CSE
With client-server architecture, the client machines need to be connected to a server machine, thus mandating
finite states and introducing latencies and overhead in terms of record to be carried between clients and
servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated into
tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in a
cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of a
client or server or just process data.
Parallelism is used to support speedup, where queries are executed faster because more resources, such
as processors and disks, are provided. Parallelism is also used to provide scale-up, where increasing
workloads are managed without increase response-time, via an increase in the degree of parallelism.
Different architectures for parallel database systems are shared-memory, shared-disk, shared-nothing, and
hierarchical structures.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and parallel
processing occurs within a specific task (i.e., table scan) that is performed concurrently on different processors
against different sets of data.
UNIT-IV 4. 28
Paavai Institutions Department of CSE
(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan, join, and
sort) are executed in parallel in a pipelined fashion. In other words, an output from one function (e.g., join) as
soon as records become available.
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors and
disks. Using intraquery parallelism is essential for speeding up long-running queries. Interquery parallelism
does not help in this function since each query is run sequentially.
Interquery Parallelism
In interquery parallelism, different queries or transaction execute in parallel with one another.This
form of parallelism can increase transactions throughput. The response times of individual transactions are not
faster than they would be if the transactions were run in isolation.
Thus, the primary use of interquery parallelism is to scale up a transaction processing system to
support a more significant number of transactions per second.
Shared-disk architecture implements a conce pt of shared ownership of the entire database between
RDBMS servers, each of which is running on a node of a distributed memory system. Each RDBMS server
can read, write, update, and delete information from the same shared database, which would need the system
to implement a form of a distributed lock manager (DLM).DLM components can be found in hardware, the
operating system, and separate software layer, all depending on the system vendor. On the positive side,
shared-disk architectures can reduce performance bottlenecks resulting from data skew (uneven distribution of
data), and can significantly increase system availability.
The shared-disk distributed memory design eliminates the memory access bottleneck typically of large SMP
systems and helps reduce DBMS dependency on data partitioning.
Shared-Memory Architecture
UNIT-IV 4. 30
Paavai Institutions Department of CSE
Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all disks, and the
DBMS is "partitioned" across multiple co-servers, each of which resides on individual nodes of the parallel
system and has an ownership of its disk and thus its database partition.A shared-nothing RDBMS parallelizes
the execution of a SQL query across multiple processing nodes.
Each processor has its memory and disk and communicates with other processors by exchanging
messages and data over the interconnection network.
This architecture is optimized specifically for the MPP and cluster systems.The shared-nothing
architectures offer near-linear scalability. The number of processor nodes is limited only by the hardware
platform limitations (and budgetary constraints), and each node itself can be a powerful SMP system.
UNIT-IV 4. 31
Paavai Institutions Department of CSE
The tools that allow sourcing of data contents and formats accurately and external data stores into the data
warehouse have to perform several essential tasks that contain:
There are several selection criteria which should be considered while implementing a data warehouse:
1. The ability to identify the data in the data source environment that can be read by the tool is necessary.
2. Support for flat files, indexed files, and legacy DBMSs is critical.
UNIT-IV 4. 32
Paavai Institutions Department of CSE
3. The capability to merge records from multiple data stores is required in many installations.
4. The specification interface to indicate the information to be extracted and conversation are essential.
5. The ability to read information from repository products or data dictionaries is desired.
6. The code develops by the tool should be completely maintainable.
7. Selective data extraction of both data items and records enables users to extract only the required data.
8. A field-level data examination for the transformation of data into information is needed.
A warehousing team will require different types of tools during a warehouse project. These software
products usually fall into one or more of the categories illustrated, as shown in the figure.
The warehouse team needs tools that can extract, transform, integrate, clean, and load information from a
source system into one or more data warehouse databases. Middleware and gateway products may be needed
for warehouses that extract a record from a host-based source system.
Warehouse Storage
UNIT-IV 4. 33
Paavai Institutions Department of CSE
Software products are also needed to store warehouse data and their accompanying metadata. Relational
database management systems are well suited to large and growing warehouses.
Different types of software are needed to access, retrieve, distribute, and present warehouse data to its end-
clients.
QUESTION BANK
PART – A
1. Define multi dimensional data model.
2. What is a data cube?
UNIT-IV 4. 34
Paavai Institutions Department of CSE
3. Define dimensions.
UNIT-IV 4. 35
Paavai Institutions Department of CSE
UNIT-IV 4. 36