Data Warehouse
Data Warehouse
Unit -I
INTRODUCTION TO DATA WAREHOUSE
Syllabus:
● It includes historical data derived from transaction data from single and multiple sources.
Attributes of DW:
○ It is a database designed for investigative tasks, using data from various applications.
1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them in
an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.
Components or Building Blocks of Data Warehouse:
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of the data from the
various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associated with their industry
produced by the external department.
● After we have extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse.
● The extracted data coming from several different sources need to be changed, converted,
and made ready in a format that is relevant to be saved for querying and analysis.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.
2) Data Transformation:
● As we know, data for a data warehouse comes from many different sources. If data
extraction for a data warehouse poses big challenges, data transformation present even
significant challenges.
● First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings ,elimination of duplicates when we bring in the same data from various
source systems.
● Standardization of data components forms a large part of data transformation.
● On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations.
3) Data Loading:
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component:
● Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system.
● In the data dictionary, we keep the data about the logical data structures, the data about
the records and addresses, the information about the indexes, and so on.
Data Marts:
● The current trends in data warehousing are to develop a data warehouse with several
smaller related data marts for particular kinds of queries and reports.
Management and Control Component:
● The management and control elements coordinate the services and functions within the
data warehouse.
● These components control the data transformation and the data transfer into the data
warehouse storage.
● It monitors the movement of information into the staging method and from there into the
data warehouses storage itself.
2. The tables and joins are complicated since they 2. The tables and joins are accessible
are normalized for RDBMS. This is done to since they are de-normalized. This is
reduce redundant files and to save storage space. done to minimize the response time for
analytical queries.
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.
7. The database is the place where the data is 7. Data Warehouse is the place where the
taken as a base and managed to get available fast application data is handled for analysis
and efficient access. and reporting objectives.
10.Relational databases are created for on-line 10.Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture:
● Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
➔ Flat Files
In this transactional data is stored & every file in the system must have a different name
➔ Meta Data
● A set of data that defines and gives information about other data and Metadata is used
to direct a query to the most appropriate data source.
● For example, author, data build, and data changed, and file size are examples of very
basic document metadata.
● We must clean and process your operational information before put it into the warehouse.
● Data warehouses uses a staging area (A place where data is processed before entering
the warehouse).
● Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.
● A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems
● A data mart is a segment of a data warehouse that can provide information for
reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
Properties of Data Warehouse Architectures:
2. Scalability: Hardware and software architectures should upgrade the data volume,
3. Extensibility: Able to perform new operations and technologies without redesigning the
whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
➔ Middle-tier
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
➔ Top-tier
● A top-tier that contains front-end tools for displaying results provided by OLAP,
as well as additional tools for data mining of the OLAP-generated data.
Metadata repository (Contains Below Info)
2. Operational metadata, which usually describes the currency level of the stored data, , i.e.,
usage statistics, error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides source
RDBMSs and their contents, cleaning and transformation rules, etc.
● Load Performance
● Load Processing
● Data Quality Management
● Query Performance
● Terabyte Scalability
● The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.
● The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population
Autonomous Data Warehouse :
● Autonomous Data Warehouse is a fully managed database tuned and optimized for
data warehouse workloads with the market-leading performance of Oracle
Database.
● EASY:
1. Fully autonomous database
2. Automated provisioning, patching and upgrades
3. Automated backups
4. Automated performance tuning
● FAST:
1. Built on Exadata: high performance, scalability and reliability
2. Built on key Oracle Database capabilities: parallelism, columnar
3. processing, compression
● ELASTIC:
1. Elastic scaling of compute and storage, without downtime
2. Pay only for resources consumed
Snowflake DataWarehouse :
Syllabus:
What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design
and Modeling - Delivery Process - Online Analytical Processing (OLAP) -
Characteristics of OLAP - Online Transaction Processing (OLTP) Vs OLAP -
OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs HOLAP
What is ETL ?
● The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction, Transformation
and Loading.
● The ETL process requires active inputs from various stakeholders, including
developers, analysts, testers.
● ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and
needs to be agile, automated, and well documented
ETL Working Mechanism:
Extraction:
● This is the first stage of the ETL process. Extraction is the operation of
extracting information from a source system for further use in a data
warehouse environment.
● Extraction process is often one of the most time-consuming tasks in the ETL.
● The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.
Cleansing:
● The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality
● The primary data cleansing features found in ETL tools are rectification and
homogenization.
● They use specific dictionaries to rectify typing mistakes and to recognize synonyms, as
well as rule-based cleansing to enforce domain-specific rules and define appropriate
associations between values.
Transformation:
● Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format
○ Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.
○ Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
○ Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
Loading:
● The Load is the process of writing the data into the target database.
● During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
ETL Vs ELT
ETL :
ELT:
Process Data is transferred to the ETL Data remains in the DB except for
server and moved back to DB. cross Database loads (e.g. source
High network bandwidth to object).
required.
○ Compute-intensive
Transformations
There are two types of host-based data warehouses which can be implemented:
Data Extraction and transformation tools allow the automated extraction and cleaning of data
from production systems.
To make such data warehouses building successful, the following phases are generally followed:
3. Load Phase: For moving the record directly into DB2 tables or a particular file for
moving it into another database or non-MVS warehouse.
● In this warehouse, we can extract information from a variety of sources and support
multiple LAN based warehouses
● The LAN based warehouse can support business users with complete data to information
solutions.
● This type of warehouse can include business views, histories, aggregation, versions in,
and heterogeneous source support, such as
a) DB2 Family
In this type of data warehouses, the data is not changed from the sources
This schema does generate several problems for the customer such as
● Providing clients the ability to query different DBMSs as is they were all a single
DBMS with a single API.
● Impacting performance since the customer will be competing with the production
data stores
The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed
throughout the enterprise and a global warehouses
Characteristics of Local(Distributed) data warehouses:
1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
4. Based upon actual usage, physically Data Warehouse is created to provide the
high-frequency results
Disadvantages:
1. Since queries compete with production record transactions, performance can be degraded.
Metadata Definition:
Example: The index of a book serves as a metadata for the contents in the book.
In other words
Types of metadata:
● Operational metadata
● Extraction and transformation metadata
● End-user metadata
→ Operational Metadata:
● As you know, data for the data warehouse comes from several operational
systems of the enterprise.
● These source systems contain different data structures.
● The data elements selected for the data warehouse have various field lengths
and data types.
→ Extraction and transformation metadata:
● Extraction and transformation metadata
● contain data about the extraction of data from the source systems, namely,
the extraction frequencies, extraction methods, and business rules for the
data extraction.
● Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
→ End-user metadata:
● The end-user metadata is the navigational map of the data warehouse.
● It enables the end-users to find information from the data warehouse.
● The end-user metadata allows the end-users to use their own business
terminology and look for information in those ways in which they normally
think of the business.
Categories of Metadata:
a) Business Metadata:
It has the data ownership information, business definition, and changing policies.
Examples:
● Data ownership
● Query and reporting tools
● Predefined queries
● Predefined reports
● Report distribution information
● Common information access routes
● Rules for analysis using OLAP
● Currency of OLAP data
● Data warehouse refresh schedule
b) Technical Metadata:
1. It includes database system names, table and column names and sizes, data
types and allowed values.
2. Technical metadata also includes structural information such as primary and
foreign key attributes and indices.
Examples:
c) Operational Metadata:
● Business metadata:
● Operational Metadata:
● Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.
● Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.
● There are no industry-wide accepted standards. Data management solution vendors have narrow
focus.
DATA MART:
● A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose
or primary data subject which may be distributed to provide business needs.
● The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record.
There are mainly two approaches to designing data marts. These approaches are
In this technique, as the data warehouse creates the data mart.Therefore, there is no need for
data mart integration. It is also known as a top-down approach.
Independent Data Mart:
It is also termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.
Cost-effective Data Marting
➢ As the merchant is not interested in the products they are not dealing with, the data marting is a
subset of the data dealing which the product group of interest. The following diagram shows data
marting for different users.
Designing
The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.
1. Creating the physical database and logical structures such as tablespaces associated with
the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.
2. Extracting data
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts
and graphs and publishing them.
It involves the following tasks:
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.
2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the LAN
or WAN has the capacity to handle the data volumes being transferred within the data mart load process.
★ Network capacity.
★ Time window available
★ Volume of data being transferred
★ Mechanisms being used to insert data into a data mart
Partitioning Strategy :
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to
load and also enhances the performance of the system.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.
Horizontal Partitioning:
➢ There are various ways in which a fact table can be partitioned. In horizontal partitioning
➢ we have to keep in mind the requirements for manageability of the data warehouse.
Partitioning Dimensions:
➢ In the round robin technique, when a new partition is needed, the old
one is archived.
➢ It uses metadata to allow user access tool to refer to the correct table
partition.
➢ This technique makes it easy to automate table management facilities
within the data warehouse.
Vertical Partition:
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
● Normalization
● Row Splitting
Normalization
Row splitting tends to leave a one-to-one map between partitions. The motive
of row splitting is to speed up the access to large table by reducing its size.
Data transformation and calculation based on the function of business rules that force
transformation.
There are several selection criteria which should be considered while implementing a data
warehouse:
. The ability to identify the data in the data source environment that can be read by the tool
is necessary.
. Support for flat files, indexed files, and legacy DBMSs is critical.
. The capability to merge records from multiple data stores is required in many installations.
. The specification interface to indicate the information to be extracted and conversation are
essential.
. The ability to read information from repository products or data dictionaries is desired.
. Selective data extraction of both data items and records enables users to extract only the
required data.
. A field-level data examination for the transformation of data into information is needed.
. The ability to perform data type and the character-set translation is a requirement when
moving data between incompatible systems.
10. The ability to create aggregation, summarization and derivation fields and records are
necessary.
11. Vendor stability and support for the products are components that must be evaluated
carefully.
Data Warehouse Software Components
A warehousing team will require different types of tools during a warehouse project. These
software products usually fall into one or more of the categories illustrated, as shown in the figure.
Ema mssEsEsssssssssEe yg
n n H
: li Source Data |
Td rs
: a Report Writers
; Extraction &
: EIS/DSS
©
: Data Mining
Alert System
i Exception Reporting
Data Warehouse Software Components
The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases. Middleware and
gateway products may be needed for warehouses that extract a record from a host-based source
system.
Warehouse Storage
Software products are also needed to store warehouse data and their accompanying metadata.
Relational database management systems are well suited to large and growing warehouses.
Different types of software are needed to access, retrieve, distribute, and present warehouse data
to its end-clients.
Types of Database Parallelism
Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide scale-up,
where increasing workloads are managed without increase response-time, via an increase in the
degree of parallelism.
Different architectures for parallel database systems are shared-memory, shared-disk, shared-
nothing, and hierarchical structures.
(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and
parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently on
different processors against different sets of data.
(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan,
join, and sort) are executed in parallel in a pipelined fashion. In other words, an output from one
function (e.g., join) as soon as records become available.
A Response
Time
Serial
RDBMS
i ine
Case1 Case2 Cased Cased
Intraquery Parallelism
Intraquery parallelism defines the execution of a single query in parallel on multiple processors
and disks. Using intraquery parallelism is essential for speeding up long-running queries.
Interquery parallelism does not help in this function since each query is run sequentially.
To improve the situation, many DBMS vendors developed versions of their products that utilized
intraquery parallelism.
This application of parallelism decomposes the serial SQL, query into lower-level operations such
as scan, join, sort, and aggregation.
Interquery Parallelism
In interquery parallelism, different queries or transaction execute in parallel with one another.
This form of parallelism can increase transactions throughput. The response times of individual
transactions are not faster than they would be if the transactions were run in isolation.
Thus, the primary use of interquery parallelism is to scale up a transaction processing system to
support a more significant number of transactions per second.
This approach naturally resulted in interquery parallelism, in which different server threads (or
processes) handle multiple requests at the same time.
Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.
Each RDBMS server can read, write, update, and delete information from the same shared
database, which would need the system to implement a form of a distributed lock manager (DLM).
DLM components can be found in hardware, the operating system, and separate software layer, all
depending on the system vendor.
On the positive side, shared-disk architectures can reduce performance bottlenecks resulting from
data skew (uneven distribution of data), and can significantly increase system availability.
The shared-disk distributed memory design eliminates the memory access bottleneck typically of
large SMP systems and helps reduce DBMS dependency on data partitioning.
Interconnection Network
Shared-Memory Architecture
It is relatively simple to implement and has been very successful up to the point where it runs into
the scalability limitations of the shared-everything architecture.
The key point of this technique is that a single RDBMS server can probably apply all processors,
access all memory, and access the entire database, thus providing the client with a consistent
single system image.
Interconnection Network
Processor
Limit
{PLI
Shared-Memory Architecture
In shared-memory SMP systems, the DBMS considers that the multiple database components
executing SQL statements communicate with each other by exchanging messages and information
via the shared memory.
All processors have access to all data, which is partitioned across local disks.
Shared-Nothing Architecture
In a shared-nothing distributed memory environment, the data is partitioned across all disks, and
the DBMS is "partitioned" across multiple co-servers, each of which resides on individual nodes of
the parallel system and has an ownership of its disk and thus its database partition.
A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple processing
nodes.
Each processor has its memory and disk and communicates with other processors by exchanging
messages and data over the interconnection network.
This architecture is optimized specifically for the MPP and cluster systems.
The shared-nothing architectures offer near-linear scalability. The number of processor nodes is
limited only by the hardware platform limitations (and budgetary constraints), and each node itself
can be a powerful SMP system.
Interconnection Network
Jit
(PL
Shared-Nothing Architecture
Data Warehouse Process Architecture
The process architecture defines an architecture in which the data from the data warehouse is
processed for a particular computation.
Centralized
Process
Architecture
In this architecture, the data is collected into single centralized storage and processed upon
completion by a single machine with a huge structure in terms of memory, processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It is very successful when the collection and consumption of data occur at the same location.
Central Data
Warehouse
In this architecture, information and its processing are allocated across data centers, and its
processing is distributed across data centers, and processing of data is localized with the group of
the results into centralized storage. Distributed architectures are used to overcome the limitations
of the centralized process architectures where all the information needs to be collected to one
central location, and results are available in one central location.
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server
does the processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine,
thus mandating finite states and introducing latencies and overhead in terms of record to be
carried between clients and servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are
isolated into tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each
device in a cluster is associated with a function that is processed locally, and the result sets are
collected to a master server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the
processing responsibilities are allocated among all machines, called peers. Each machine can
perform the function of a client or server or just process data.
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
Location
dimension
table
Branch dimension
table
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes
keys to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The
shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and
to_location, and two measures: Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
AY
; /
Information Analytical Data Mining
Processing Processing
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays,
information processing of data warehouse is to construct a low cost, web-based accessing tools
typically integrated with web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to
support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or schema.
The complex nature of OLAP applications requires a multidimensional view of the data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models, operating
classification and prediction, and performing the mining results using visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by
changing through high amounts of a record save in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.
Star Schema
o In a star schema, the fact table will be at the center and is connected to the dimension
tables.
Dimension Dimension
Table Table
EAE Py
FmEnSOn
Table
iti
Snowflake Schema
o A snowflake schema is an extension of star schema where the dimension tables are
connected to one or more dimensions.
o Data redundancy is low and occupies less disk space when compared to star schema.
Dimension
Fact/Measura
Dimension Dimensicn
Dr n
The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles a
snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with
each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which
can be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake
schema are generally normalized to the third normal form. Each dimension table performs exactly
one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three levels.
A snowflake schemas can have any number of dimension, and each dimension can have any
number of levels.
om am table
Si=T
Snowflake Schema
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables with
Store as the primary dimension table, and Location as the outrigger dimension table. The product
dimension has three dimension tables with Product as the primary dimension table, and the Line
and Family table are the outrigger dimension tables.
Location
Postal code ID
Postal code
Region name
Region director
State name
State director
State population
City name
City population
Family
Store
FamilyID
Store ID Family narne
Store name Family description
Store size
Store address Sales fact Line
Postal code ID Store ID
Product ID LinelD
Time Time ID Line name
Sale Line description
Time ID Cost of goods sold Family ID
Year Advertising
Quarter number Product
Quater name
Month number Product ID
Month name Product name
Week of year Product description
Day name Product ounces
Weekday Product caffeinated
Holiday Line ID
Day of year
Day of month
Fiscal year
Fiscal quarter number
Fiscal month number
A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk space
is not recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table
include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME
are the dimension tables.
PRODUCT
Customer Key
Product Key Customer name
Product name
Product code
Brand name
SALES FACTS
Product Key
Tirne key
\ Customer Key 7
Salesrep Key
Sales price
Margin
SALESREP
Time Key ’ :
Date Salesrep Key
Month Salesperson name
Quarter Territory name
Year Region name
STAR Schema
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized
version now extends to eleven tables. We will notice that in the snowflake schema, the attributes
with low cardinality in each original dimension tables are removed to form separate tables. These
new tables are connected back to the original dimension table through artificial keys.
CATEGORY COUNTRY
CUSTOMER Country Key
Category Key
country nama
Product Category
Customer Kay
Customer name
Customer coda
Brand Kay PRODUCT Address
Brand name
Product Key State
Category Key
Product nama Zip REGION
Product code Country Kay
fegion Key
Package Kay
Reglon name
Package Key N
TERRITORY.A
package type \ Product Key Territory Key
Time Key Territory name
Customer Key Region Key
Sales quantity
Sales dollars SALESREP
Sales price \
Time Key Margin A!
Data Salesrep Key A
Month Salesperson name fl
Quarter Territory Key
Year
Snowflake Schema
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.
1. The primary advantage of the snowflake schema is the development in query performance
due to minimized disk storage requirements and joining smaller lookup tables.
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact star
schema.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.
Dimension
Table
Dimension
| Dimension
Table Table
= rg
Dimension Dimension
Table Table
Star Schema
Fact Tables
A table in a star schema which contains fact: " onnected to dimensions. A fact table has two
types of columns: those that include fact anc that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its foreign
keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary
keys of each of the dimensions table are part of the composite primary keys of the fact table.
Dimensional attributes help to define the dimensional value. They are generally descriptive, textual
values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
The star schema is intensely suitable for data warehouse database design because of the following
features:
o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
Star Schemas are easy for end-users and application to understand and navigate. With a well-
designed schema, the customer can instantly analyze large, multidimensional data sets.
Easily
ors
Perfsu mance Understood
Load —
Performance ulit-in
and
ref ential
aere
administration | integrity
“
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table, are
almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.
In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.
Structural simplicity also decreases the time required to load large batches of record into a star
schema database. By describing facts and dimensions and separating them into the various table,
the impact of a load structure is reduced. Dimension table can be populated once and
occasionally refreshed. We can add new facts regularly and selectively by appending records to a
fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is
enforced because each data in dimensional tables has a unique primary key, and all keys in the
fact table are legitimate foreign keys drawn from the dimension table. A record in the fact table
which is not related correctly to a dimension cannot be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact
table. These joins are more significant to the end-user because they represent the fundamental
relationship between parts of the underlying business. Customer can also browse dimension table
attributes before constructing a query.
There is some condition which cannot be meet by star schemas like the relationship between the
user, and bank account cannot describe as star schema as the relationship between them is many
to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables
connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns
for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each
branch_key, branch_name, branch_type. The LOCATION table has columns of geographic data,
including street, city, state, and country.
Dimension Table
Dimension Table
Dimension Table
Dimension Table
In this scenario, the SALES table contains only four columns with IDs from the dimension tables,
TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for
ITEM data, three columns for BRANCH data, and four columns for LOCATION data. Thus, the size
of the fact table is significantly reduced. When we need to change an item, we need only make a
single change in the dimension table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.
What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
psc
none
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e, the attributes whose values are of interest. Another attributes are
selected as dimensions or functional attributes. The measure attributes are aggregated according
to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it known as a dimensional table, which describes
the dimensions. For example, a dimension tz items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear
how to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such
as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.
Data Cube
Date
& 1Qtr 2Qtr 3Qtr 4Qtr sum
& Wr——3—— 3 7
LE STF
VCRZ £Z 4 yd
Country
il
Es
8
[1] AlLAILAII
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold
per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
2-D view of Sales Data
location ="Vancouver”
item (type)
home
time (quarter) entertainment computer phone security
Q1 605 825 14 400
2 680 052 E)| 512
3 812 1023 30 501
3 027 1038 38 580
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we
would like to view the data according to time, item as well as the location for the cities Chicago,
New York, Toronto, and Vancouver. The measured display in dollars sold (in thousands). These 3-D
data are shown in the table. The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
3-D Data Cube
“2
AY Chicago
&® 20 Toronto
">818 _~746 43 591 A
~ ancora = Pd
rdw
Q1| 605 | 825 | 14 | 400 2 Le
Ne]
computer [security
home phone
entertainment
item (types)
Let us suppose that we would like to view our sales data with an additional fourth dimension, such
as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.
& supplier="SUP1" supplier="sUp2" supplier="sUP3"
Chicago, _ _ o_o J
a NewYo >
oF Toronto
¥" Vancouvel BY (10 er Tr Lr | 7 ie
= Fg
al
time (quarter
Qz2
Q3
Q4
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data
cubes for the dimension time, item, location, and supplier. Each cuboid represents a different
degree of summarization.
all 0-D (Apex) Cuboid
In dimensional modeling, the transaction record is divided into either “facts,” which are
frequently numerical transaction data, or "dimensions," which are the reference information that
gives context to the facts. For example, a sale transaction can be damage into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, user name, product number, order ship-to, and bill-to locations, and salesman
responsible for receiving the order.
1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.
Dimensional modeling promotes data quality: The star schema enable warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the es: of its associated dimensions, a factual record
is actively loaded if the corresponding dimen xcords are duly described and also exist in the
database.
By enforcing foreign key constraints as a form of referential integrity check, data warehouse DBAs
add a line of defense against corrupted warehouses data.
Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have to
wait for hours to get a response to a query will quickly become discouraged with the warehouses.
Aggregates are one of the easiest methods by which query performance can be optimized.
1. To maintain the integrity of fact and dimensions, loading the data warehouses with a record
from various operational systems is complicated.
2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.
Fact
It is a collection of associated data items, consisting of measures and context data. It typically
represents business items or business transactions.
Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the contextual
background for the facts, and they are the framework over which OLAP is performed.
Measure
Considering the relational context, there are two basic models which are used in dimensional
modeling:
o Star Model
o Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table
(fact table) and a set of smaller tables (dimensions) arranged in a radial design around the primary
table. The snowflake model is the conclusion of decomposing one or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data elements
that are of interest to the company.
The fact table includes numerical values of what we measure. For example, a fact value of 20 might
means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign keys
in the fact table.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.
Dimension tables contain the details about the facts. That, as an example, enables the business
analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table. That is,
they contain the attributes of the facts. For example, the dimension tables for a marketing analysis
function might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of columns.
The dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query
results display.
Example: A city and state can view a store summary in a fact table. Item summary can be viewed
by brand, color, etc. Customer information can be viewed by name and address.
J — =
x ¥ =
Cu stip’
's
Dimension Table
Fact Table
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with the
dimension table. By following the links, we can see that row 2 of the fact table records the fact
that customer 3, Gaurav, bought two items on day 8.
Dimension Tables
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model many
to one association between dimensional attributes team. It contains a dimension, positioned at
the tree's root, and all of the dimensional attributes that define it.
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Eg 8 33
he: Tabular representation
2 FE
1M 111 (25
1M] 21 (8
1M] 31 |15
Multidimensional representation
12 | 1 1 30
12121 (20
1213 1 |50
Slice locid=1
13 | 1 1 (8 13] 8 1101 10 is shown
1312 (110 2 12(30|20|50
13131 10 11]125| 8 |15 locid
1M] 1235 1 2 3
Timeid
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the
table. In this 2D representation, the sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of an item sold). The fact or
measure displayed in rupee_sold (in thousands).
Location="Delhi"
item (type)
Time (quarter) Egg | Milk | Bread | Biscuit
Q1 260 | 508 | 15 60
Q2 390 | 256 20 90
Q3 436 | 396 | 50 40
Q4 528 | 483 | 35 50
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Time | Egg Milk [Bread| Biscuit | Egg | Milk |Bread | Biscuit | Egg | Milk [Bread | Biscuit| Egg | Milk|Bread [Biscuit
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
& chenns/ 340 J 360 J 20 VAL
& Ne) a Kolkata
435 / aco / 20
Mum
we Ne 390 / 385 / / 30
Delhi
Note − The above list can be used as evaluation parameters for the
evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
Note − The Event manager monitors the events occurrences and deals with
them. The event manager also tracks the myriad of things that can go
wrong on this complex data warehouse system.
Events :
Events are the actions that are generated by the user or the system itself.
It may be noted that the event is a measurable, observable, occurrence of a defined
action.
➔ Hardware failure
➔ Running out of space on certain key disks
➔ A process dying
➔ A process returning an error
➔ CPU usage exceeding an 805 threshold
➔ Internal contention on database serialization points
➔ Buffer cache hit ratios exceeding or failure below threshold
➔ A table reaching to maximum of its size
➔ Excessive memory swapping
➔ A table failing to extend due to lack of space
➔ Disk exhibiting I/O bottlenecks
➔ Usage of temporary or sort area reaching a certain thresholds
➔ Any other database shared memory usage
➔ The most important thing about events is that they should be capable of
executing on their own.
➔ Event packages define the procedures for the predefined events. The code
associated with each event is known as event handler.
➔ This code is executed whenever an event occurs.
The criteria for choosing a system and the database manager are as follows −
● The backup and recovery tool makes it easy for operations and management
staff to back-up the data.
● Note that the system backup manager must be integrated with the schedule
manager software being used.
The important features that are required for the management of backups are as
follows −
➔ Scheduling
➔ Backup data tracking
➔ Database awareness
Backups are taken only to protect against data loss. Following are the important
points to remember −
★ The backup software will keep some form of database of where and when the
piece of data was backed up.
★ The backup recovery manager must have a good front-end to that database.
★ The backup recovery software should be database aware.
★ Being aware of the database, the software then can be addressed in
database terms, and will not perform backups that would not be viable.
Data Warehousing - Process Managers
Process managers are responsible for maintaining the flow of data both into and out
of the data warehouse. There are three different types of process managers −
➢ Load manager
➢ Warehouse manager
➢ Query manager
● Load manager performs the operations required to extract and load the data
into the database.
● The size and complexity of a load manager varies between specific solutions
from one data warehouse to another.
Fast Load
➢ In order to minimize the total load window, the data needs to be loaded into
the warehouse in the fastest possible time.
➢ Transformations affect the speed of data processing.
➢ It is more effective to load the data into a relational database prior to
applying transformations and checks.
➢ Gateway technology is not suitable, since they are inefficient when large data
volumes are involved.
Simple Transformations
1.Strip out all the columns that are not required within the warehouse.
Warehouse Manager:
● The warehouse manager is responsible for the warehouse management
process.
● It consists of a third-party system software, C programs, and shell scripts.
● The size and complexity of a warehouse manager varies between specific
solutions.