DMW Unit 1
DMW Unit 1
DMW Unit 1
● Overview,
● Definition,
● Data Warehousing Components,
● Building a Data Warehouse,
● Warehouse Database,
● Mapping the Data Warehouse to a Multiprocessor
Architecture,
● Difference between Database System and Data
Warehouse,
● Multi-Dimensional Data Model,
● Data Cubes,
● Stars,
● Snow Flakes,
● Fact Constellations,
● Concept.
1
Data Warehousing - Overview
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
2
Why a Data Warehouse is Separated from Operational
Databases
A data warehouses is kept separate from operational databases due to the following reasons
−
3
● Non-volatile − Non-volatile means the previous data is not erased when new data is
added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
As discussed before, a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole part of a
plan-execute-assess "closed-loop" feedback system for the enterprise management. Data
warehouses are widely used in the following fields −
● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −
● Information Processing − A data warehouse allows to process the data stored in it.
The data can be processed by means of querying, basic statistical analysis, reporting
using crosstabs, tables, charts, or graphs.
● Analytical Processing − A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.
4
● Data Mining − Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing classification and
prediction. These mining results can be presented using the visualization tools.
2 OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers, DBAs, or database professionals.
and analysts.
5
12 The database size is from 100GB to The database size is from 100 MB to
100 TB. 100 GB.
6
Building a Data Warehouse
Building a data warehouse involves several steps, from initial planning to implementation
and maintenance. A data warehouse (DW) is designed to aggregate and organize large
amounts of data from different sources to facilitate analysis and decision-making. Here’s
an overview of the steps involved in building a data warehouse:
2. Data Modeling
● Identify Data Sources: List the different systems (e.g., CRM, ERP, transactional
databases, third-party APIs) that will feed data into the warehouse.
● Schema Design: Design the structure of the warehouse. There are two common
approaches:
○ Star Schema: Uses fact tables and dimension tables for simplicity.
○ Snowflake Schema: A more normalized version of the star schema with
additional tables.
● Data Marts: Consider creating smaller, subject-specific databases (data marts) that
feed into the central warehouse.
● Extract: Gather data from various data sources, such as databases, flat files, cloud
services, or APIs.
7
● Transform: Cleanse, normalize, and aggregate the data. This includes handling
missing values, removing duplicates, applying business rules, and transforming the
data into the desired format.
● Load: Insert the cleaned data into the data warehouse. Loading can be done in two
ways:
○ Batch Loading: Scheduled, periodic loads (e.g., daily or weekly).
○ Real-Time Loading: Continuous updating as new data comes in.
● Choose the Storage Model: Decide on the type of data warehouse platform, either
cloud-based (e.g., AWS Redshift, Google BigQuery, Snowflake) or on-premise (e.g.,
Microsoft SQL Server, Oracle).
● Design the Physical Database: Set up indexing, partitioning, and sharding to improve
performance for large datasets.
● Data Archiving: Plan for long-term storage of historical data and its retention
policies.
● Select BI Tools: Choose tools for querying, reporting, and visualization (e.g., Tableau,
Power BI, Looker).
● Data Aggregation and Cubes: Use OLAP (Online Analytical Processing) cubes for
multidimensional data analysis.
8
● Build Dashboards and Reports: Create user-friendly reports and visualizations to help
stakeholders understand the data.
● Data Validation: Ensure that the data in the warehouse is accurate and matches
the source data.
● Performance Testing: Test the system for performance issues such as query speed
and ETL job efficiency.
● Optimize Queries: Indexing, partitioning, and denormalization techniques can help
improve query performance.
● Deploy the Data Warehouse: Move the data warehouse from development to
production.
● Monitoring: Continuously monitor the performance, data loading processes, and
usage patterns.
● Scaling: Plan for future growth and the need to scale storage, compute, and query
capabilities.
● Maintenance: Regularly update the ETL process, add new data sources, and ensure
that reports meet changing business needs.
● Feedback Loop: Regularly gather feedback from users to improve the system.
● Adapt to New Business Needs: Update the warehouse structure as the business
evolves.
Key Considerations:
● Cloud vs. On-Premise: Cloud data warehouses offer scalability, but on-premise
solutions give more control.
9
● Big Data Compatibility: If dealing with extremely large datasets, choose a
warehouse architecture that can handle big data efficiently (e.g., Hadoop-based
storage).
● Cost Management: Track the cost of data storage and processing, especially if using
a cloud-based platform where costs can rise quickly with increased usage.
10
Components or Building Blocks of Data
Warehouse
Architecture is the proper arrangement of the elements. We build a data warehouse with
software and hardware components. To suit the requirements of our organizations, we
arrange these building we may want to boost up another part with extra tools and services.
All of these depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block.
In the middle, we see the Data Storage component that handles the data warehouses data.
This element not only stores and manages the data; it also keeps track of data using the
metadata repository. The Information Delivery component shows on the right consists of all
the different ways of making the information from the data warehouses available to the
users.
11
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data,
part of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks as
part of data transformation.
12
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.
13
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the logical
data structures, the data about the records and addresses, the information about the
indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to particular selected subjects. Data in a data warehouse should be a
fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable. Data
marts are lower than data warehouses and usually contain organization. The current trends
in data warehousing are to developed a data warehouse with several smaller related data
marts for particular kinds of queries and reports.
14
correctly saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.
It may require the use of distinctive data organization, access, and implementation method
based on multidimensional views.
Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of
data, it is necessary to maintain separate databases.
15
Database Data Warehouse
2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are de-normalized. This is
This is done to reduce redundant files done to minimize the response time for
and to save storage space. analytical queries.
4. Entity: Relational modeling procedures 4. Data: Modeling approach are used for
are used for RDBMS database design. the Data Warehouse design.
7. The database is the place where the 7. Data Warehouse is the place where the
data is taken as a base and managed to application data is handled for analysis
get available fast and efficient access. and reporting objectives.
16
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather
detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to
the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
17
Data Warehouse Architecture: Basic
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build, and data
changed, and file size are examples of very basic document metadata.
18
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming
from multiple source systems, especially for enterprise data warehouses where all relevant
data of an enterprise is consolidated
19
Data Warehouse Staging Area is a temporary location where a record from source systems
is copied.
We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department or
operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or
mine historical information to make predictions about customer behavior.
20
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse system:
2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements,
which have to be met, progressively increase.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the
data warehouses.
21
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed to
operational data after the middleware interprets them. In this way, queries affect
transactional workloads.
22
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture
for a data warehouse system, as shown in fig:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and load
source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository:
a data warehouse. The data warehouses can be directly accessed, but it can also be used as
a source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
23
feature aggregate information navigators, complex query optimizers, and customer-friendly
GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases, the
reconciled layer is also directly used to accomplish better some operational tasks, such as
producing daily reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.
24
25
What is Dimensional Modeling?
Dimensional modeling represents data with a cube operation, making more suitable logical
data representation with OLAP data management. The perception of Dimensional Modeling
was developed by Ralph Kimball and is consist of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information
that gives context to the facts. For example, a sale transaction can be damage into facts
such as the number of products ordered and the price paid for the products, and into
dimensions such as order date, user name, product number, order ship-to, and bill-to
locations, and salesman responsible for receiving the order.
1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.
Dimensional modeling promotes data quality: The star schema enable warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the essentials of its associated dimensions, a factual
record is actively loaded if the corresponding dimensions records are duly described and also
exist in the database.
26
By enforcing foreign key constraints as a form of referential integrity check, data
warehouse DBAs add a line of defense against corrupted warehouses data.
Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have
to wait for hours to get a response to a query will quickly become discouraged with the
warehouses. Aggregates are one of the easiest methods by which query performance can be
optimized.
1. Fact
It is a collection of associated data items, consisting of measures and context data.
It typically represents business items or business transactions.
2. Dimensions
It is a collection of data which describe one business dimension. Dimensions decide
the contextual background for the facts, and they are the framework over which
OLAP is performed.
3. Measure
It is a numeric attribute of a fact, representing the performance or behavior of the
business relative to the dimensions.
27
Considering the relational context, there are two basic models which are used in
dimensional modeling:
○ Star Model
○ Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad
central table (fact table) and a set of smaller tables (dimensions) arranged in a radial
design around the primary table. The snowflake model is the conclusion of decomposing one
or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data
elements that are of interest to the company.
● The fact table includes numerical values of what we measure. For example, a fact
value of 20 might means that 20 widgets have been sold.
● Each fact table includes the keys to associated dimension tables. These are known
as foreign keys in the fact table.
● Fact tables typically include a small number of columns.
● When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that
describe the facts.
Dimension tables contain the details about the facts. That, as an example, enables the
business analysts to understand the data and their reports better.
28
The dimension tables include descriptive data about the numerical values in the fact table.
That is, they contain the attributes of the facts. For example, the dimension tables for a
marketing analysis function might include attributes such as time, marketing region, and
product type.
Since the record in a dimension table is denormalized, it usually has a large number of
columns. The dimension tables include significantly fewer rows of information than the
fact table.
The attributes in a dimension table are used as row and column headings in a document or
query results display.
Example: A city and state can view a store summary in a fact table. Item summary can be
viewed by brand, color, etc. Customer information can be viewed by name and address.
29
Fact Table
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with
the dimension table. By following the links, we can see that row 2 of the fact table
records the fact that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs
model many to one association between dimensional attributes team. It contains a
dimension, positioned at the tree's root, and all of the dimensional attributes that define it.
30
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube enables data
to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the save to
keep track of things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a dimensional table, which
describes the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension (classified according to the
types of an item sold). The fact or measure displayed in rupee_sold (in thousands).
31
Now, if we want to view the sales data with a third dimension, For example, suppose the
data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data
of the table are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube,
as shown in fig:
32
33
Data Cube
What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can
be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to
34
keep track of things like monthly sales of items, and the branches and locations at which
the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be
sparse in many cases because not every cell in each dimension may have corresponding
data in the database.
If a query contains constants at even lower levels than those provided in a data cube, it is
not clear how to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.
35
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location for
the cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold
(in thousands). These 3-D data are shown in the table. The 3-D data of the table are
represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in
fig:
36
Advertisement
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest
level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.
37
Figure is shown a 4-D data cube representation of sales data, according to the dimensions
time, item, location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the
apex cuboid. In this example, this is the total sales, or dollars sold, summarized over all
four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid represents
a different degree of summarization.
38
Schema
Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation
schema.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is
known as star schema because the entity-relationship diagram of this schemas simulates a
star, with points, diverge from a central table. The center of the schema consists of a
large fact table, and the points of the star are the dimension tables.
39
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize
data. If a dimension has not got hierarchies and levels, it is called a flat dimension or list.
The primary keys of each of the dimensions table are part of the composite primary keys
of the fact table. Dimensional attributes help to define the dimensional value. They are
generally descriptive, textual values. Dimensional tables are usually small in size than fact
table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
40
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.
Query Performance
A star schema database has a limited number of table and clear join paths, the query run
faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple tables
takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.
41
various table, the impact of a load structure is reduced. Dimension table can be populated
once and occasionally refreshed. We can add new facts regularly and selectively by
appending records to a fact table.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through
the fact table. These joins are more significant to the end-user because they represent the
fundamental relationship between parts of the underlying business. Customer can also
browse dimension table attributes before constructing a query.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.
42
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an
item, we need only make a single change in the dimension table, instead of making many
changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.
43
What is Snowflake Schema?
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake
if one or more dimension tables do not connect directly to the fact table but must join
through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of snowflake
schema resembles a snowflake. Snowflaking is a method of normalizing the dimension
tables in a STAR schemas. When we normalize all the dimension tables entirely, the
resultant structure resembles a snowflake with the fact table in the middle.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables
in a snowflake schema are generally normalized to the third normal form. Each dimension
table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can
have any number of levels.
44
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the outrigger
dimension table. The product dimension has three dimension tables with Product as the
primary dimension table, and the Line and Family table are the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed
more disk space than a more normalized snowflake schema. Snowflaking normalizes the
dimension by moving attributes with low cardinality into separate dimension tables that
relate to the core dimension table by using foreign keys. Snowflaking for the sole purpose
of minimizing disk space is not recommended, because it can adversely impact query
performance.
45
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact
table include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT,
and TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are removed
to form separate tables. These new tables are connected back to the original dimension
table through artificial keys.
46
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between
dimension levels.
47
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
48
Let's see the differentiate between Star and Snowflake Schema.
Ease of Use Less complex queries and More complex queries and
simple to understand therefore less easy to
understand
Query Performance Less number of foreign More foreign keys and thus
keys and hence lesser more query execution time
query execution time
49
Type of Data Warehouse Good for data marts with Good to use for data
simple relationships (one warehouse core to simplify
to one or one to many) complex relationships
(many to many)
Data Warehouse system Work best in any data Better for small data
warehouse/ data mart warehouse/data mart.
50
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.
51
This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table for
sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost and
units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging
design because many variants for specific kinds of aggregation must be considered and
selected.
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Nowadays, information processing of data warehouse is to construct a low cost, web-based
accessing tools typically integrated with web browsers.
52
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is
to support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or
schema. The complex nature of OLAP applications requires a multidimensional view of the
data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.
Data mining is the technique of designing essential new correlations, patterns, and trends
by changing through high amounts of a record save in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques.
53
Data Warehousing - Concepts
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations.
There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and effectively.
They can gather data, analyze it, and take decisions based on the information present in
the warehouse. The information gathered in a warehouse can be used in any of the
following domains −
● Query-driven Approach
● Update-driven Approach
54
● Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was
used to build wrappers and integrators on top of multiple heterogeneous databases. These
integrators are also known as mediators.
When a query is issued to a client side, a metadata dictionary translates the query into an
appropriate form for individual heterogeneous sites involved.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow
update-driven approach rather than the traditional approach discussed earlier. In
update-driven approach, the information from multiple heterogeneous sources are integrated
in advance and are stored in a warehouse. This information is available for direct querying
and analysis.
Advantages
55
● The data is copied, processed, integrated, annotated, summarized and restructured in
semantic data store in advance.
● Query processing does not require an interface to process data at local sources.
The following are the functions of data warehouse tools and utilities −
Note − Data cleaning and data transformation are important steps in improving the quality
of data and data mining results.
56