0% found this document useful (0 votes)
13 views143 pages

Data Warehouse

The document provides an overview of data warehousing, covering its definition, components, architecture, and the differences between operational databases and data warehouses. It explains the need for data warehouses, their characteristics, and the ETL (Extraction, Transformation, Loading) process essential for data integration. Additionally, it discusses modern data warehousing solutions like Autonomous Data Warehouse and Snowflake, highlighting their benefits and functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views143 pages

Data Warehouse

The document provides an overview of data warehousing, covering its definition, components, architecture, and the differences between operational databases and data warehouses. It explains the need for data warehouses, their characteristics, and the ETL (Extraction, Transformation, Loading) process essential for data integration. Additionally, it discusses modern data warehousing solutions like Autonomous Data Warehouse and Snowflake, highlighting their benefits and functionalities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

CCS341 - DATA WAREHOUSING

Unit -I
INTRODUCTION TO DATA WAREHOUSE

Syllabus:

Data warehouse Introduction - Data warehouse components- operational


database Vs data warehouse – Data warehouse Architecture – Three-tier
Data Warehouse Architecture - Autonomous Data Warehouse- Autonomous
Data Warehouse Vs Snowflake - Modern Data Warehouse

What is a Data Warehouse?


● A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing.

● It includes historical data derived from transaction data from single and multiple sources.

● "Data Warehouse is a subject-oriented, integrated, and time-variant store of information


in support of management's decisions."

Attributes of DW:

○ It is a database designed for investigative tasks, using data from various applications.

○ It supports a relatively small number of clients with relatively long interactions.

○ It includes current and historical data to provide a historical perspective of information.

○ Its usage is read-intensive.

○ It contains a few large tables.

Characteristics of Data Warehouse:


● Subject Oriented
● Integrated
● Time Variant
● Non-Volatile
Goals of Data Warehousing:

○ To help reporting as well as analysis

○ Maintain the organization's historical information

○ Be the foundation for decision making

Need for Data Warehouse:

1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them in
an elementary form.

2) Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.

3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.

4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.

5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.
Components or Building Blocks of Data Warehouse:

Source Data Component :

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of the data from the
various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associated with their industry
produced by the external department.

Data Staging Component:

● After we have extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse.
● The extracted data coming from several different sources need to be changed, converted,
and made ready in a format that is relevant to be saved for querying and analysis.

1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.
2) Data Transformation:

● As we know, data for a data warehouse comes from many different sources. If data
extraction for a data warehouse poses big challenges, data transformation present even
significant challenges.
● First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings ,elimination of duplicates when we bring in the same data from various
source systems.
● Standardization of data components forms a large part of data transformation.
● On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations.

3) Data Loading:

● Two distinct categories of tasks form data loading functions.


● When we complete the structure and construction of the data warehouse and go live for
the first time, we do the initial loading of the information into the data warehouse storage.
● The initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components:

● Data storage for the data warehousing is a split repository.


● The data repositories for the operational systems generally include only the current data.
● Also, these data repositories include the data structured in highly normalized for fast and
efficient processing.

Information Delivery Component:

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component:

● Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system.
● In the data dictionary, we keep the data about the logical data structures, the data about
the records and addresses, the information about the indexes, and so on.

Data Marts:

● It includes a subset of corporate-wide data that is of value to a specific group of users.


● The scope is confined to particular selected subjects.
● Data marts are lower than data warehouses and usually contain organization.

● The current trends in data warehousing are to develop a data warehouse with several
smaller related data marts for particular kinds of queries and reports.
Management and Control Component:

● The management and control elements coordinate the services and functions within the
data warehouse.
● These components control the data transformation and the data transfer into the data
warehouse storage.
● It monitors the movement of information into the staging method and from there into the
data warehouses storage itself.

Difference between Database and Data Warehouse:

Database Data Warehouse

1. It is used for Online Transactional Processing 1. It is used for Online Analytical


(OLTP) but can be used for other objectives such Processing (OLAP). This reads the
as Data Warehousing. This records the data from historical information for the customers
the clients for history. for business decisions.

2. The tables and joins are complicated since they 2. The tables and joins are accessible
are normalized for RDBMS. This is done to since they are de-normalized. This is
reduce redundant files and to save storage space. done to minimize the response time for
analytical queries.
3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical


queries.

7. The database is the place where the data is 7. Data Warehouse is the place where the
taken as a base and managed to get available fast application data is handled for analysis
and efficient access. and reporting objectives.

8.Data In 8.Data Out

9.Less Number of data accessed. 9.Large Number of data accessed.

10.Relational databases are created for on-line 10.Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
Data Warehouse Architecture:

● It is a method of defining the overall architecture of data communication processing


and presentation within the enterprise.

● Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).

● Includes forecasting, profiling, summary reporting, and trend analysis.

Three common architectures are:

○ Data Warehouse Architecture: Basic

○ Data Warehouse Architecture: With Staging Area

○ Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic


➔ Operational System:

System that is used to process the day-to-day transactions of an organization.

➔ Flat Files

In this transactional data is stored & every file in the system must have a different name

➔ Meta Data

● A set of data that defines and gives information about other data and Metadata is used
to direct a query to the most appropriate data source.
● For example, author, data build, and data changed, and file size are examples of very
basic document metadata.

Lightly and highly summarized data

● The goals of the summarized information are to speed up query performance.


● The summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

Customers interact with the warehouse using end-client access tools.

○ Reporting and Query Tools

○ Application Development Tools

○ Executive Information Systems Tools

○ Online Analytical Processing Tools

○ Data Mining Tools

Data Warehouse Architecture: With Staging Area

● We must clean and process your operational information before put it into the warehouse.
● Data warehouses uses a staging area (A place where data is processed before entering
the warehouse).

● Data Warehouse Staging Area is a temporary location where a record from source
systems is copied.

● A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems

Data Warehouse Architecture: With Staging Area and Data Marts

● A data mart is a segment of a data warehouse that can provide information for
reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
Properties of Data Warehouse Architectures:

1. Separation: Analytical and transactional processing should be keep apart

2. Scalability: Hardware and software architectures should upgrade the data volume,

3. Extensibility: Able to perform new operations and technologies without redesigning the
whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.

5. Administrability: Data Warehouse management should not be complicated.

Three-Tier Data Warehouse Architecture

1. Bottom Tier (Data Warehouse Server)

2. Middle Tier (OLAP Server)

3. Top Tier (Front end Tools).


➔ Bottom Tier (Data Warehouse Server)
● A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.
● Data from operational databases and external sources are extracted using application
program interfaces called a gateway.
● A gateway is provided by the underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.
● Examples of gateways contain ODBC (Open Database Connection) and OLE-DB
(Open-Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database
Connection).

➔ Middle-tier

● A middle-tier which consists of an OLAP server for fast querying of


the data warehouse

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps
functions on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

➔ Top-tier

● A top-tier that contains front-end tools for displaying results provided by OLAP,
as well as additional tools for data mining of the OLAP-generated data.
Metadata repository (Contains Below Info)

1. DW structure, including the warehouse schema, dimension, hierarchies, data mart


locations, and contents, etc.

2. Operational metadata, which usually describes the currency level of the stored data, , i.e.,
usage statistics, error reports, audit, etc.

3. System performance data, which includes indices, used to improve data access and
retrieval performance.

4. Information about the mapping from operational databases, which provides source
RDBMSs and their contents, cleaning and transformation rules, etc.

5. Summarization algorithms, predefined queries, and reports business data


Principles of Data Warehousing :

● Load Performance
● Load Processing
● Data Quality Management
● Query Performance
● Terabyte Scalability

● The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.
● The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population
Autonomous Data Warehouse :

● Autonomous Data Warehouse is a fully managed database tuned and optimized for
data warehouse workloads with the market-leading performance of Oracle
Database.

● Autonomous Data Warehouse continuously monitors all aspects of system


performance. It adjusts autonomously to ensure consistent high performance
even as workloads, query types, and the number of users vary over time.

● Oracle Cloud provides a set of data management services built on self-driving


Oracle Autonomous Database technology to deliver automated patching, upgrades,
and tuning, including performing all routine database maintenance tasks while the
system is running, without human intervention.

Uses or Benefits of Autonomous DW:

● EASY:
1. Fully autonomous database
2. Automated provisioning, patching and upgrades
3. Automated backups
4. Automated performance tuning

● FAST:
1. Built on Exadata: high performance, scalability and reliability
2. Built on key Oracle Database capabilities: parallelism, columnar
3. processing, compression

● ELASTIC:
1. Elastic scaling of compute and storage, without downtime
2. Pay only for resources consumed
Snowflake DataWarehouse :

Snowflake is a cloud-based data warehouse that runs on Amazon Web Services or


Microsoft Azure. It's great for enterprises that don't want to devote resources to the
setup, maintenance, and support of in-house servers because there's no hardware or
software to choose, install, configure, or manage.

Snowflake's data architecture has three main layers:


● Database Storage
● Query Processing
● Cloud Services
At functional level, to access data from Snowflake, the following
components are required:
-> Choose proper roles after logging
-> Virtual Warehouse known as Warehouse in Snowflake to perform any
activity
-> Database Schema
-> Database
-> Tables and columns
Snowflake provides the following high-level analytics functionalities:
-> Data Transformation
-> Supports for Business Application
-> Business Analytics/Reporting/BI
-> Data Science
-> Data Sharing to other data systems
-> Data Cloning
Main Difference:
CCS341 - DATA WAREHOUSING
Unit -II
ETL AND OLAP TECHNOLOGY

Syllabus:

What is ETL – ETL Vs ELT – Types of Data warehouses - Data warehouse Design
and Modeling - Delivery Process - Online Analytical Processing (OLAP) -
Characteristics of OLAP - Online Transaction Processing (OLTP) Vs OLAP -
OLAP operations- Types of OLAP- ROLAP Vs MOLAP Vs HOLAP

What is ETL ?

● The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction, Transformation
and Loading.

● The ETL process requires active inputs from various stakeholders, including
developers, analysts, testers.

● ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and
needs to be agile, automated, and well documented
ETL Working Mechanism:

Extraction:

● This is the first stage of the ETL process. Extraction is the operation of
extracting information from a source system for further use in a data
warehouse environment.

● Extraction process is often one of the most time-consuming tasks in the ETL.

● The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.

Cleansing:
● The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality

● The primary data cleansing features found in ETL tools are rectification and
homogenization.

● They use specific dictionaries to rectify typing mistakes and to recognize synonyms, as
well as rule-based cleansing to enforce domain-specific rules and define appropriate
associations between values.
Transformation:

● Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format

The following points must be rectified in this phase:

○ Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show that this is a Limited Partnership company.

○ Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.

Following are the main transformation processes aimed at populating the reconciled data layer:

○ Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.

○ Matching that associates equivalent fields in different sources.

○ Selection that reduces the number of source fields and records.

Loading:

● The Load is the process of writing the data into the target database.

● During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
ETL Vs ELT

ETL :

ELT:

Basics ETL ELT

Process Data is transferred to the ETL Data remains in the DB except for
server and moved back to DB. cross Database loads (e.g. source
High network bandwidth to object).
required.

Transformation Transformations are performed Transformations are performed (in


in ETL Server. the source or) in the target.
Code Usage Typically used for Typically used for

○ Source to target transfer ○ High amounts of data

○ Compute-intensive
Transformations

○ Small amount of data

Time-Maintenance It needs highs maintenance as Low maintenance as data is


you need to select data to load always available.
and transform.

Calculations Overwrites existing column or Easily add the calculated column


Need to append the dataset and to the existing table.
push to the target platform.

Types of Data warehouses :

1. Host Based Data Warehouse


2. LAN Based Work Group
3. Single Stage Data Warehouse
4. Multi Stage Data Warehouse
5. Stationary Data Warehouse
6. Distributed Data Warehouse
7. Virtual Data Warehouse
1. Host-Based Data Warehouses:

There are two types of host-based data warehouses which can be implemented:

● Host-Based mainframe warehouses which reside on a high volume database.


Supported by robust and reliable high capacity structure
● Host-Based LAN data warehouses, where data delivery can be handled either
centrally or from the workgroup environment.

Data Extraction and transformation tools allow the automated extraction and cleaning of data
from production systems.

To make such data warehouses building successful, the following phases are generally followed:

1. Unload Phase: It contains selecting and scrubbing the operation data.


2. Transform Phase: For translating it into an appropriate form and describing the rules for
accessing and storing it.

3. Load Phase: For moving the record directly into DB2 tables or a particular file for
moving it into another database or non-MVS warehouse.

LAN-Based Workgroup Data Warehouses:


● A LAN based workgroup warehouse is an integrated structure for building and
maintaining a data warehouse in a LAN environment.

● In this warehouse, we can extract information from a variety of sources and support
multiple LAN based warehouses

● A LAN based workgroup warehouse ensures the delivery of information from


corporate resources by providing transport access to the data in the warehouse.
Single Stage (LAN) Data Warehouses:
● A single store frequently drives a LAN based warehouse and provides existing DSS
applications, enabling the business user to locate data in their data warehouse.

● The LAN based warehouse can support business users with complete data to information
solutions.

● This type of warehouse can include business views, histories, aggregation, versions in,
and heterogeneous source support, such as

a) DB2 Family

b) IMS, VSAM, Flat File [MVS and VM]

Multi-Stage Data Warehouses:


● It refers to multiple stages in transforming methods for analyzing data through
aggregations.
● In other words, staging of the data multiple times before the loading operation into the
data warehouse
● Data gets extracted from source systems to staging area first, then gets loaded to data
warehouse after the change and then finally to departmentalized data marts.

Stationary Data Warehouses:

In this type of data warehouses, the data is not changed from the sources

This schema does generate several problems for the customer such as

● Identifying the location of the information for the users

● Providing clients the ability to query different DBMSs as is they were all a single
DBMS with a single API.
● Impacting performance since the customer will be competing with the production
data stores

Distributed Data Warehouses:

The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed
throughout the enterprise and a global warehouses
Characteristics of Local(Distributed) data warehouses:

● Bulk of the operational processing


● Local site is autonomous
● Each local data warehouse has its unique architecture and contents of data
● The data is unique and of prime essential to that locality only
● Majority of the record is local and not replicated

Virtual Data Warehouses:

Virtual Data Warehouses is created in the following stages:

1. Installing a set of data approach, data dictionary, and process management facilities.

2. Training end-clients.

3. Monitoring how DW facilities will be used

4. Based upon actual usage, physically Data Warehouse is created to provide the
high-frequency results

Disadvantages:

1. Since queries compete with production record transactions, performance can be degraded.

2. There is no metadata, no summary record, or no individual DSS (Decision Support


System) integration or history. All queries must be copied, causing an additional burden
on the system.

3. There is no refreshing process, causing the queries to be very complex


UNIT - 3

META DATA, DATA MART AND PARTITION STRATEGY

Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository-


Challenges for Meta Management – Data Mart – Need of Data Mart- Cost Effective Data
Mart- Designing Data Marts- Cost of Data Marts- Partitioning Strategy – Vertical
partition – Normalization – Row Splitting – Horizontal Partition

Metadata Definition:

● Metadata is simply defined as data about data.


● The data that is used to represent other data is known as metadata.
● Metadata is the road-map to a data warehouse.
● Metadata in a data warehouse defines the warehouse objects.
● Metadata acts as a directory.

Example: The index of a book serves as a metadata for the contents in the book.
In other words

Types of metadata:

● Operational metadata
● Extraction and transformation metadata
● End-user metadata

→ Operational Metadata:
● As you know, data for the data warehouse comes from several operational
systems of the enterprise.
● These source systems contain different data structures.
● The data elements selected for the data warehouse have various field lengths
and data types.
→ Extraction and transformation metadata:
● Extraction and transformation metadata
● contain data about the extraction of data from the source systems, namely,
the extraction frequencies, extraction methods, and business rules for the
data extraction.
● Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.

→ End-user metadata:
● The end-user metadata is the navigational map of the data warehouse.
● It enables the end-users to find information from the data warehouse.
● The end-user metadata allows the end-users to use their own business
terminology and look for information in those ways in which they normally
think of the business.

Categories of Metadata:
a) Business Metadata:

It has the data ownership information, business definition, and changing policies.

Examples:

● Data ownership
● Query and reporting tools
● Predefined queries
● Predefined reports
● Report distribution information
● Common information access routes
● Rules for analysis using OLAP
● Currency of OLAP data
● Data warehouse refresh schedule

b) Technical Metadata:

​ 1. It includes database system names, table and column names and sizes, data
types and allowed values.
​ 2. Technical metadata also includes structural information such as primary and
foreign key attributes and indices.

Examples:

● Data models of source systems


● Record layouts of outside sources
● Data aggregation rules
● Data cleansing rules
● Summarization and derivations
● Data loading and refresh schedules and controls
● Job dependencies

c) Operational Metadata:

​ 1. It includes currency of data and data lineage.


​ 2. Currency of data means whether the data is active, archived, or purged.
​ 3. Lineage of data means the history of data migrated and transformation
applied on it.

Role of Metadata:

Various roles of metadata are explained below(Mentioned in above Image).

● Metadata acts as a directory.


● This directory helps the decision support system to locate the contents of the data warehouse.
● Metadata helps in decision support systems for mapping of data when data is transformed from
operational environment to data warehouse environment.
● Metadata helps in summarization between current detailed data and highly summarized data.
● Metadata also helps in summarization between lightly detailed data and highly summarized data.
● Metadata is used for query tools.
● Metadata is used in extraction and cleansing tools.
● Metadata is used in reporting tools.
● Metadata is used in transformation tools.
● Metadata plays an important role in loading functions
Metadata Repository:

Metadata repository is an integral part of a data warehouse system.


It has the following metadata,

● Definition of data warehouse:

1. It includes the description of the structure of data warehouse.


2. The description is defined by schema, view, hierarchies, derived
data definitions, and data mart locations and contents.

● Business metadata:

It contains the data ownership information, business definition,


and changing policies.

● Operational Metadata:

1. It includes currency of data and data lineage.

2. Currency of data means whether the data is active, archived, or purged.

3. Lineage of data means the history of data migrated and transformation


applied on it.

● Data for mapping from operational environment to data warehouse:

It includes the source databases and their contents, data extraction,


data partition cleaning, transformation rules, data refresh and purging rules.

● Algorithms for summarization − It includes dimension algorithms, data on


granularity, aggregation, summarizing, etc
➢Challenges for Metadata Management:
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations

● Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.

● Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.

● There are no industry-wide accepted standards. Data management solution vendors have narrow
focus.

● There are no easy and accepted methods of passing metadata

DATA MART:

● A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose
or primary data subject which may be distributed to provide business needs.

● The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record.

Data Warehouse vs Data Mart:


Need of Data Mart:

● To partition data in order to impose access control strategies.

● To speed up the queries by reducing the volume of data to be scanned.

● To segment data into different hardware platforms.

● To structure data in a form suitable for a user access tool.

➢ Types of Data Mart:

There are mainly two approaches to designing data marts. These approaches are

● Dependent Data Marts

● Independent Data Marts

Dependent Data Mart:

In this technique, as the data warehouse creates the data mart.Therefore, there is no need for
data mart integration. It is also known as a top-down approach.
Independent Data Mart:

It is also termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.
Cost-effective Data Marting

Follow the steps given below to make data marting cost-effective −

● Identify the Functional Splits


● Identify User Access Tool Requirements
● Identify Access Control Issues

Identify the Functional Splits


➢ In this step, we determine if the organization has natural functional splits.
➢ We look for departmental splits, and we determine whether the way in which departments use
information tend to be in isolation from the rest of the organization.
➢ Let's have an example.
➢ Consider a retail organization, where each merchant is accountable for maximizing the sales of a
group of products. For this, the following are the valuable information −
● sales transaction on a daily basis
● sales forecast on a weekly basis
● stock position on a daily basis
● stock movements on a daily basis

➢ As the merchant is not interested in the products they are not dealing with, the data marting is a
subset of the data dealing which the product group of interest. The following diagram shows data
marting for different users.

Identify User Access Tool Requirements


➢ We need data marts to support user access tools that require internal data structures.
➢ The data in such structures are outside the control of data warehouse but need to be populated and
updated on a regular basis.
➢ There are some tools that populate directly from the source system but some cannot.
➢ Therefore additional requirements outside the scope of the tool are needed to be identified for
future.
Identify Access Control Issues
➢ There should be privacy rules to ensure the data is accessed by authorized users only.
➢ For example a data warehouse for a retail banking institution ensures that all the accounts belong
to the same legal entity.
➢ Privacy laws can force you to totally prevent access to information that is not owned by the
specific bank.
➢ Data marts allow us to build a complete wall by physically separating data segments within the
data warehouse.
➢ To avoid possible privacy problems, the detailed data can be removed from the data warehouse.
➢ We can create a data mart for each legal entity and load it via a data warehouse, with detailed
account data.
Designing Data Marts:

Steps in Implementing a Data Mart:


The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make
informed decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources

3. Selecting the appropriate subset of data

4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated with
the data mart.

2. creating the schema objects such as tables and indexes describe in the design step.

3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources

2. Extracting data

3. Cleansing and transforming the information.

4. Loading data into the data mart

5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports, charts
and graphs and publishing them.
It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.

2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.

3. Optimizing the system for better performance.

4. Ensuring the availability of data event with system failures.

Cost of Data Marting

The cost measures for data marting are as follows −

● Hardware and Software Cost


● Network Access
● Time Window Constraints

Hardware and Software Cost


➢ Although data marts are created on the same hardware, they require some additional hardware
and software.
➢ To handle user queries, it requires additional processing power and disk storage. If detailed data
and the data mart exist within the data warehouse, then we would face additional cost to store and
manage replicated data.

Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the LAN
or WAN has the capacity to handle the data volumes being transferred within the data mart load process.

Time Window Constraints


The extent to which a data mart loading process will eat into the available time window depends on the
complexity of the transformations and the data volumes being shipped. The determination of how many
data marts are possible depends on −

★ Network capacity.
★ Time window available
★ Volume of data being transferred
★ Mechanisms being used to insert data into a data mart

Partitioning Strategy :

● Partitioning is done to enhance performance and facilitate easy


management of data.
● Partitioning also helps in balancing the various requirements of the
system.
● It optimizes the hardware performance and simplifies the management
of data warehouse by partitioning each fact table into multiple separate
partitions.

Why is it Necessary to Partition?


● For easy management,
● To assist backup/recovery,
● To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact
table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to
load and also enhances the performance of the system.

To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance
is enhanced because now the query scans only those partitions that are relevant. It does not have to scan
the whole data.

Horizontal Partitioning:
➢ There are various ways in which a fact table can be partitioned. In horizontal partitioning
➢ we have to keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments


➢ In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each
time period represents a significant retention period within the business.
➢ For example, if the user queries for month to date data then it is appropriate to partition the data
into monthly segments.
➢ We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments


This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.
Partition on a Different Dimension:
● The fact table can also be partitioned on the basis of dimensions other than time such as product
group, region, supplier, or any other dimension. Let's have an example.
● Suppose a market function has been structured into distinct regional departments like on a state
by state basis.
● If each region wants to query on information captured within its region, it would prove to be
more effective to partition the fact table into regional partitions.
● This will cause the queries to speed up because it does not require to scan information that is not
relevant.

Partition by Size of Table


➢ When there are no clear basis for partitioning the fact table on any dimension, then we should
partition the fact table on the basis of their size.
➢ We can set the predetermined size as a critical point.
➢ When the table exceeds the predetermined size, a new table partition is created.

Partitioning Dimensions:

➢ If a dimension contains large number of entries, then it is required to


partition the dimensions.
➢ Here we have to check the size of a dimension.
➢ Consider a large design that changes over time.
➢ If we need to store all the variations in order to apply comparisons, that
dimension may be very large. This would definitely affect the response
time.
Round Robin Partitions:

➢ In the round robin technique, when a new partition is needed, the old
one is archived.
➢ It uses metadata to allow user access tool to refer to the correct table
partition.
➢ This technique makes it easy to automate table management facilities
within the data warehouse.

Vertical Partition:
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.

Vertical partitioning can be performed in the following two ways −

● Normalization
● Row Splitting
Normalization

❖ Normalization is the standard relational method of database


organization.
❖ In this method, the rows are collapsed into a single row, hence it
reduce space.
❖ Take a look at the following tables that show how normalization is
performed.

Table before and After Normalization:


Row Splitting:

Row splitting tends to leave a one-to-one map between partitions. The motive
of row splitting is to speed up the access to large table by reducing its size.

Identify Key to Partition


● It is very crucial to choose the right partition key.
● Choosing a wrong partition key will lead to reorganizing the fact table.
● Let's have an example. Suppose we want to partition the following
table.

● If we partition by transaction_date instead of region, then the latest


transaction from every region will be in one partition.
● Now the user who wants to look at data within his own region has to
query across multiple partitions.
● Hence it is worth determining the right partitioning key.
Data Warehouse Tools
The tools that allow sourcing of data contents and formats accurately and external data stores into
the data warehouse have to perform several essential tasks that contain:

Data consolidation and integration.

Data transformation from one form to another form.

Data transformation and calculation based on the function of business rules that force
transformation.

Metadata synchronization and management, which includes storing or updating metadata


about source files, transformation actions, loading formats, and events.

There are several selection criteria which should be considered while implementing a data
warehouse:

. The ability to identify the data in the data source environment that can be read by the tool
is necessary.

. Support for flat files, indexed files, and legacy DBMSs is critical.

. The capability to merge records from multiple data stores is required in many installations.

. The specification interface to indicate the information to be extracted and conversation are
essential.

. The ability to read information from repository products or data dictionaries is desired.

. The code develops by the tool should be completely maintainable.

. Selective data extraction of both data items and records enables users to extract only the
required data.

. A field-level data examination for the transformation of data into information is needed.

. The ability to perform data type and the character-set translation is a requirement when
moving data between incompatible systems.

10. The ability to create aggregation, summarization and derivation fields and records are
necessary.

11. Vendor stability and support for the products are components that must be evaluated
carefully.
Data Warehouse Software Components

A warehousing team will require different types of tools during a warehouse project. These
software products usually fall into one or more of the categories illustrated, as shown in the figure.
Ema mssEsEsssssssssEe yg
n n H

Extraction & : Warehouse : Data Access &


Transformation : Technology Retrieval

: li Source Data |
Td rs
: a Report Writers

; Extraction &
: EIS/DSS

©
: Data Mining

Alert System
i Exception Reporting
Data Warehouse Software Components

Extraction and Transformation

The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases. Middleware and
gateway products may be needed for warehouses that extract a record from a host-based source
system.

Warehouse Storage

Software products are also needed to store warehouse data and their accompanying metadata.
Relational database management systems are well suited to large and growing warehouses.

Data access and retrieval

Different types of software are needed to access, retrieve, distribute, and present warehouse data
to its end-clients.
Types of Database Parallelism
Parallelism is used to support speedup, where queries are executed faster because more
resources, such as processors and disks, are provided. Parallelism is also used to provide scale-up,
where increasing workloads are managed without increase response-time, via an increase in the
degree of parallelism.

Different architectures for parallel database systems are shared-memory, shared-disk, shared-
nothing, and hierarchical structures.

(a)Horizontal Parallelism: It means that the database is partitioned across multiple disks, and
parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently on
different processors against different sets of data.

(b)Vertical Parallelism: It occurs among various tasks. All component query operations (i.e., scan,
join, and sort) are executed in parallel in a pipelined fashion. In other words, an output from one
function (e.g., join) as soon as records become available.

A Response
Time

Serial
RDBMS

Sort Horizontal Parallelism Vertical Parallelism Parallel RDBMS (Query


(Data Partitioning) (Query Partitioning) Decomposition,Data
Partitioning)
Join [1]

i ine
Case1 Case2 Cased Cased

Intraquery Parallelism

Intraquery parallelism defines the execution of a single query in parallel on multiple processors
and disks. Using intraquery parallelism is essential for speeding up long-running queries.

Interquery parallelism does not help in this function since each query is run sequentially.
To improve the situation, many DBMS vendors developed versions of their products that utilized
intraquery parallelism.

This application of parallelism decomposes the serial SQL, query into lower-level operations such
as scan, join, sort, and aggregation.

These lower-level operations are executed concurrently, in parallel.

Interquery Parallelism

In interquery parallelism, different queries or transaction execute in parallel with one another.

This form of parallelism can increase transactions throughput. The response times of individual
transactions are not faster than they would be if the transactions were run in isolation.

Thus, the primary use of interquery parallelism is to scale up a transaction processing system to
support a more significant number of transactions per second.

Database vendors started to take advantage of parallel hardware architectures by implementing


multiserver and multithreaded systems designed to handle a large number of client requests
efficiently.

This approach naturally resulted in interquery parallelism, in which different server threads (or
processes) handle multiple requests at the same time.

Interquery parallelism has been successfully implemented on SMP systems, where it increased the
throughput and allowed the support of more concurrent users.

Shared Disk Architecture

Shared-disk architecture implements a concept of shared ownership of the entire database


between RDBMS servers, each of which is running on a node of a distributed memory system.

Each RDBMS server can read, write, update, and delete information from the same shared
database, which would need the system to implement a form of a distributed lock manager (DLM).

DLM components can be found in hardware, the operating system, and separate software layer, all
depending on the system vendor.
On the positive side, shared-disk architectures can reduce performance bottlenecks resulting from
data skew (uneven distribution of data), and can significantly increase system availability.

The shared-disk distributed memory design eliminates the memory access bottleneck typically of
large SMP systems and helps reduce DBMS dependency on data partitioning.

Interconnection Network

Local Local Local Local


Memory Memory Memory Mernory

Global Shared Disk Subsystem

Distributed memory shared-disk architecture

Shared-Memory Architecture

Shared-memory or shared-everything style is the traditional approach of implementing an RDBMS


on SMP hardware.

It is relatively simple to implement and has been very successful up to the point where it runs into
the scalability limitations of the shared-everything architecture.

The key point of this technique is that a single RDBMS server can probably apply all processors,
access all memory, and access the entire database, thus providing the client with a consistent
single system image.
Interconnection Network

Processor
Limit
{PLI

Local Local Local Local


Memory Memory Memory Mernory

Global Shared Memory

Shared-Memory Architecture

In shared-memory SMP systems, the DBMS considers that the multiple database components
executing SQL statements communicate with each other by exchanging messages and information
via the shared memory.

All processors have access to all data, which is partitioned across local disks.

Shared-Nothing Architecture

In a shared-nothing distributed memory environment, the data is partitioned across all disks, and
the DBMS is "partitioned" across multiple co-servers, each of which resides on individual nodes of
the parallel system and has an ownership of its disk and thus its database partition.

A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple processing
nodes.

Each processor has its memory and disk and communicates with other processors by exchanging
messages and data over the interconnection network.

This architecture is optimized specifically for the MPP and cluster systems.

The shared-nothing architectures offer near-linear scalability. The number of processor nodes is
limited only by the hardware platform limitations (and budgetary constraints), and each node itself
can be a powerful SMP system.
Interconnection Network

Jit
(PL

Shared-Nothing Architecture
Data Warehouse Process Architecture
The process architecture defines an architecture in which the data from the data warehouse is
processed for a particular computation.

Following are the two fundamental process architectures:

Centralized
Process
Architecture

Centralized Process Architecture

In this architecture, the data is collected into single centralized storage and processed upon
completion by a single machine with a huge structure in terms of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.

It requires minimal resources both from people and system perspectives.

It is very successful when the collection and consumption of data occur at the same location.
Central Data
Warehouse

Centralized Process Architecture

Distributed Process Architecture

In this architecture, information and its processing are allocated across data centers, and its
processing is distributed across data centers, and processing of data is localized with the group of
the results into centralized storage. Distributed architectures are used to overcome the limitations
of the centralized process architectures where all the information needs to be collected to one
central location, and results are available in one central location.

There are several architectures of the distributed process:

Client-Server

In this architecture, the user does all the information collecting and presentation, while the server
does the processing and management of data.

Three-tier Architecture

With client-server architecture, the client machines need to be connected to a server machine,
thus mandating finite states and introducing latencies and overhead in terms of record to be
carried between clients and servers.

N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are
isolated into tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each
device in a cluster is associated with a function that is processed locally, and the result sets are
collected to a master server that returns it to the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients. Instead, all the
processing responsibilities are allocated among all machines, called peers. Each machine can
perform the function of a client or server or just process data.
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.

Fact Table | Dimension Table Fact table Il

Business results Product Business Forecast

FACT Constellation Schema

Fact Constellation Schema is a sophisticated database design that is difficult to summarize


information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


i Time
. Item dimension Shipper dimension
dimension table Sales fact table Shipping fact table table
table

Location
dimension
table

Branch dimension
table

This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes
keys to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The
shipping table has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and
to_location, and two measures: Rupee_cost and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.

Data Warehouse Applications

The application areas of the data warehouse are:


Data Warehouse Applications

AY
; /
Information Analytical Data Mining
Processing Processing

Information Processing

It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays,
information processing of data warehouse is to construct a low cost, web-based accessing tools
typically integrated with web browsers.

Analytical Processing

It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.

OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to
support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or schema.
The complex nature of OLAP applications requires a multidimensional view of the data.

Data Mining

It helps in the analysis of hidden design and association, constructing scientific models, operating
classification and prediction, and performing the mining results using visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by
changing through high amounts of a record save in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.

It is the phase of selection, exploration, and modeling of huge quantities of information to


determine regularities or relations that are at first unknown to access precise and useful results for
the owner of the database.

It is the process of inspection and analysis, by automatic or semi-automatic means, of large


quantities of records to discover meaningful patterns and rules.
Difference between Star and Snowflake Schemas

Star Schema

o In a star schema, the fact table will be at the center and is connected to the dimension
tables.

© The tables are completely in a denormalized structure.

o SQL queries performance is good as there is less number of joins involved.

o Data redundancy is high and occupies more disk space.

Dimension Dimension
Table Table

EAE Py
FmEnSOn
Table
iti

Snowflake Schema

o A snowflake schema is an extension of star schema where the dimension tables are
connected to one or more dimensions.

o The tables are partially denormalized in structure.


o The performance of SQL queries is a bit less when compared to star schema as more
number of joins are involved.

o Data redundancy is low and occupies less disk space when compared to star schema.

Let's see the differentiate between Star and Snowflake Schema.

Dimension

Fact/Measura

Dimension Dimensicn

Dr n

Star Schema SnEWEIdke sehen


What is Snowflake Schema?
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or
more dimension tables do not connect directly to the fact table but must join through other
dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema
resembles a snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant structure resembles a
snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed with
each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables, which
can be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake
schema are generally normalized to the third normal form. Each dimension table performs exactly
one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having three levels.
A snowflake schemas can have any number of dimension, and each dimension can have any
number of levels.

mmm Dimension 1 mmm Dimension 1 mamma Dimension]


EERE = TF =F
emmmms Leveld = level? Emma Lovell

om am table
Si=T

Snowflake Schema
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time,
Product, Line, and Family dimension tables. The Market dimension has two dimension tables with
Store as the primary dimension table, and Location as the outrigger dimension table. The product
dimension has three dimension tables with Product as the primary dimension table, and the Line
and Family table are the outrigger dimension tables.

Location

Postal code ID
Postal code
Region name
Region director
State name
State director
State population
City name
City population

Family
Store
FamilyID
Store ID Family narne
Store name Family description
Store size
Store address Sales fact Line
Postal code ID Store ID
Product ID LinelD
Time Time ID Line name
Sale Line description
Time ID Cost of goods sold Family ID
Year Advertising
Quarter number Product
Quater name
Month number Product ID
Month name Product name
Week of year Product description
Day name Product ounces
Weekday Product caffeinated
Holiday Line ID
Day of year
Day of month
Fiscal year
Fiscal quarter number
Fiscal month number

A star schema store all attributes for a dimension into one denormalized table. This needed more
disk space than a more normalized snowflake schema. Snowflaking normalizes the dimension by
moving attributes with low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk space
is not recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables
are damaged into multiple dimension tables.

Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table
include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME
are the dimension tables.

PRODUCT
Customer Key
Product Key Customer name
Product name
Product code
Brand name

SALES FACTS
Product Key
Tirne key
\ Customer Key 7
Salesrep Key

Sales price
Margin
SALESREP

Time Key ’ :
Date Salesrep Key
Month Salesperson name
Quarter Territory name
Year Region name

STAR Schema

The STAR schema for sales, as shown above, contains only five tables, whereas the normalized
version now extends to eleven tables. We will notice that in the snowflake schema, the attributes
with low cardinality in each original dimension tables are removed to form separate tables. These
new tables are connected back to the original dimension table through artificial keys.
CATEGORY COUNTRY
CUSTOMER Country Key
Category Key
country nama
Product Category
Customer Kay
Customer name
Customer coda
Brand Kay PRODUCT Address
Brand name
Product Key State
Category Key
Product nama Zip REGION
Product code Country Kay
fegion Key
Package Kay
Reglon name

Package Key N
TERRITORY.A
package type \ Product Key Territory Key
Time Key Territory name
Customer Key Region Key

Sales quantity
Sales dollars SALESREP
Sales price \
Time Key Margin A!
Data Salesrep Key A
Month Salesperson name fl
Quarter Territory Key
Year

Snowflake Schema

A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between dimension
levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query performance
due to minimized disk storage requirements and joining smaller lookup tables.

2. It provides greater scalability in the interrelationship between dimension levels and


components.

3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact star
schema.

2. There are more complex queries and hence, difficult to understand.


What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are organized into
facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer.

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.

Dimension
Table

Dimension
| Dimension

Table Table

= rg

Dimension Dimension
Table Table

Star Schema

Fact Tables

A table in a star schema which contains fact: " onnected to dimensions. A fact table has two
types of columns: those that include fact anc that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its foreign
keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.

Dimension Tables

A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary
keys of each of the dimensions table are part of the composite primary keys of the fact table.
Dimensional attributes help to define the dimensional value. They are generally descriptive, textual
values. Dimensional tables are usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.

Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the following
features:

o It creates a DE-normalized database that can quickly provide query responses.

o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.

o It provides a parallel in design to how end-users typically think of and use the data.

It reduces the complexity of metadata for both developers and end-users.


[o}

Advantages of Star Schema

Star Schemas are easy for end-users and application to understand and navigate. With a well-
designed schema, the customer can instantly analyze large, multidimensional data sets.

The main advantage of star schemas in a decision-support environment are:


Advantages of Star Schema

Easily
ors
Perfsu mance Understood
Load —
Performance ulit-in
and
ref ential
aere
administration | integrity

Query Performance

A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table, are
almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.

In a star schema database design, the dimension is connected only through the central fact table.
When the two-dimension table is used in a query, only one join path, intersecting the fact tables,
exist between those two tables. This design feature enforces authentic and consistent query
results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record into a star
schema database. By describing facts and dimensions and separating them into the various table,
the impact of a load structure is reduced. Dimension table can be populated once and
occasionally refreshed. We can add new facts regularly and selectively by appending records to a
fact table.
Built-in referential integrity

A star schema has referential integrity built-in when information is loaded. Referential integrity is
enforced because each data in dimensional tables has a unique primary key, and all keys in the
fact table are legitimate foreign keys drawn from the dimension table. A record in the fact table
which is not related correctly to a dimension cannot be given the correct key value to be retrieved.

Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only through the fact
table. These joins are more significant to the end-user because they represent the fundamental
relationship between parts of the underlying business. Customer can also browse dimension table
attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship between the
user, and bank account cannot describe as star schema as the relationship between them is many
to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables
connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns
for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each
branch_key, branch_name, branch_type. The LOCATION table has columns of geographic data,
including street, city, state, and country.
Dimension Table
Dimension Table

Sales Fact Table

Dimension Table
Dimension Table

In this scenario, the SALES table contains only four columns with IDs from the dimension tables,
TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for
ITEM data, three columns for BRANCH data, and four columns for LOCATION data. Thus, the size
of the fact table is significantly reduced. When we need to change an item, we need only make a
single change in the dimension table, instead of making many changes in the fact table.

We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.
What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.

psc

none

Eight views of data cubes for sales information.

A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e, the attributes whose values are of interest. Another attributes are
selected as dimensions or functional attributes. The measure attributes are aggregated according
to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of
things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it known as a dimensional table, which describes
the dimensions. For example, a dimension tz items may contain the attributes item_name,
brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it is not clear
how to make the best use of the precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the multidimensional
data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional
data model is organized around a central theme, like sales and transactions. A fact table
represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such
as Rs_sold) and keys to each of the related dimensional tables.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for
analyzing the relationship between dimensions.

Data Cube
Date
& 1Qtr 2Qtr 3Qtr 4Qtr sum
& Wr——3—— 3 7
LE STF
VCRZ £Z 4 yd
Country
il
Es
8

[1] AlLAILAII

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold
per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
2-D view of Sales Data
location ="Vancouver”

item (type)

home
time (quarter) entertainment computer phone security
Q1 605 825 14 400
2 680 052 E)| 512
3 812 1023 30 501
3 027 1038 38 580

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For example, suppose we
would like to view the data according to time, item as well as the location for the cities Chicago,
New York, Toronto, and Vancouver. The measured display in dollars sold (in thousands). These 3-D
data are shown in the table. The 3-D data of the table are represented as a series of 2-D tables.

3-D view of Sales Data


location ="Chicago” location ="New York" location ="Toronto”
item item item
home home home
time ent. comp. phone sec. time comp. phone ent. comp. phone sec

Q1 854 BB 80 623 1087 os8 38 818 746 43 5%


02 043 BID a4 698 1130 1024 4 894 769 53 682
Q3 1032 924 50 789 1034 1048 45 040 795 58 728
Q4 1120 092 a3 870 1142 1001 54 978 Be4 50 TB4

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
3-D Data Cube
“2
AY Chicago

&® 20 Toronto
">818 _~746 43 591 A
~ ancora = Pd
rdw
Q1| 605 | 825 | 14 | 400 2 Le
Ne]

v =2 Q2 680 | 952 | 31 | 512 * =


FLA |
£ = 03 12° te
= 3 812 | 1023] 30 | 501 of" |7
—- 12h vd
Q4 | 927 1038 | 38 580

computer [security

home phone
entertainment
item (types)

Let us suppose that we would like to view our sales data with an additional fourth dimension, such
as a supplier.

In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.
& supplier="SUP1" supplier="sUp2" supplier="sUP3"
Chicago, _ _ o_o J
a NewYo >
oF Toronto
¥" Vancouvel BY (10 er Tr Lr | 7 ie
= Fg
al
time (quarter

Qz2
Q3
Q4

computer |securty computer | security omputer| security

home phone home phone home phone


entertainment entertainment entertainment

iteml{types) item(types) item(types)

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data
cubes for the dimension time, item, location, and supplier. Each cuboid represents a different
degree of summarization.
all 0-D (Apex) Cuboid

supplier 1-D Cuboids

time, item location,


: 2-D Cuboids
supplier

item, location, 3-D Cuboids


time, item, supplier
pplier

time, item, location,supplier 4-D (Base) Cuboid


What is Dimensional Modeling?
Dimensional modeling represents data with a cube operation, making more suitable logical data
representation with OLAP data management. The perception of Dimensional Modeling was
developed by Ralph Kimball and is consist of "fact" and "dimension" tables.

In dimensional modeling, the transaction record is divided into either “facts,” which are
frequently numerical transaction data, or "dimensions," which are the reference information that
gives context to the facts. For example, a sale transaction can be damage into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, user name, product number, order ship-to, and bill-to locations, and salesman
responsible for receiving the order.

Objectives of Dimensional Modeling

The purposes of dimensional modeling are:

1. To produce database architecture that is easy for end-clients to understand and write
queries.

2. To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.

Advantages of Dimensional Modeling

Following are the benefits of dimensional modeling are:

Dimensional modeling is simple: Dimensional modeling methods make it possible for


warehouse designers to create database schemas that business customers can easily hold and
comprehend. There is no need for vast training on how to read diagrams, and there is no
complicated relationship between different data elements.

Dimensional modeling promotes data quality: The star schema enable warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the es: of its associated dimensions, a factual record
is actively loaded if the corresponding dimen xcords are duly described and also exist in the
database.
By enforcing foreign key constraints as a form of referential integrity check, data warehouse DBAs
add a line of defense against corrupted warehouses data.

Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have to
wait for hours to get a response to a query will quickly become discouraged with the warehouses.
Aggregates are one of the easiest methods by which query performance can be optimized.

Disadvantages of Dimensional Modeling

1. To maintain the integrity of fact and dimensions, loading the data warehouses with a record
from various operational systems is complicated.

2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.

Elements of Dimensional Modeling

Fact

It is a collection of associated data items, consisting of measures and context data. It typically
represents business items or business transactions.

Dimensions

It is a collection of data which describe one business dimension. Dimensions decide the contextual
background for the facts, and they are the framework over which OLAP is performed.

Measure

It is a numeric attribute of a fact, representing the performance or behavior of the business


relative to the dimensions.

Considering the relational context, there are two basic models which are used in dimensional
modeling:

o Star Model

o Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table
(fact table) and a set of smaller tables (dimensions) arranged in a radial design around the primary
table. The snowflake model is the conclusion of decomposing one or more of the dimensions.

Fact Table

Fact tables are used to data facts or measures in the business. Facts are the numeric data elements
that are of interest to the company.

Characteristics of the Fact table

The fact table includes numerical values of what we measure. For example, a fact value of 20 might
means that 20 widgets have been sold.

Each fact table includes the keys to associated dimension tables. These are known as foreign keys
in the fact table.

Fact tables typically include a small number of columns.

When it is compared to dimension tables, fact tables have a large number of rows.

Dimension Table

Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.

Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example, enables the business
analysts to understand the data and their reports better.

The dimension tables include descriptive data about the numerical values in the fact table. That is,
they contain the attributes of the facts. For example, the dimension tables for a marketing analysis
function might include attributes such as time, marketing region, and product type.

Since the record in a dimension table is denormalized, it usually has a large number of columns.
The dimension tables include significantly fewer rows of information than the fact table.

The attributes in a dimension table are used as row and column headings in a document or query
results display.
Example: A city and state can view a store summary in a fact table. Item summary can be viewed
by brand, color, etc. Customer information can be viewed by name and address.

Sales (StorelD, ItemID, CustID, qty, price)


StorelD (storeid, city, state)
ItemID (itemid, category, brand, color, size)
CustlD (custid, name, address)
ltemiD
storelD |... Fact Table

+ storelD | temID | CustiD lprice.


aa —
= — J

J — =
x ¥ =

Cu stip’
's
Dimension Table

Fact Table

Time ID Product ID Customer ID Unit Sold

4 17 2 1

8 21 3 2

8 4 1 1

In this example, Customer ID column in the facts table is the foreign keys that join with the
dimension table. By following the links, we can see that row 2 of the fact table records the fact
that customer 3, Gaurav, bought two items on day 8.
Dimension Tables

Customer ID Name Gender Income Education Region

1 Rohan Male 2 3 4

2 Sandeep Male 3 5 1

3 Gaurav Male 1 7 3

Hierarchy

A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model many
to one association between dimensional attributes team. It contains a dimension, positioned at
the tree's root, and all of the dimensional attributes that define it.
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.

Eg 8 33
he: Tabular representation

2 FE
1M 111 (25
1M] 21 (8
1M] 31 |15
Multidimensional representation
12 | 1 1 30

12121 (20
1213 1 |50
Slice locid=1
13 | 1 1 (8 13] 8 1101 10 is shown
1312 (110 2 12(30|20|50
13131 10 11]125| 8 |15 locid
1M] 1235 1 2 3
Timeid

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the
table. In this 2D representation, the sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of an item sold). The fact or
measure displayed in rupee_sold (in thousands).
Location="Delhi"
item (type)
Time (quarter) Egg | Milk | Bread | Biscuit

Q1 260 | 508 | 15 60
Q2 390 | 256 20 90

Q3 436 | 396 | 50 40

Q4 528 | 483 | 35 50

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.

Location="Chennai® Location="Kolkata" Location="Mumbai" | Location="Delhi"


item item item item

Time | Egg Milk [Bread| Biscuit | Egg | Milk |Bread | Biscuit | Egg | Milk [Bread | Biscuit| Egg | Milk|Bread [Biscuit

ol 340 | 360 | 20 10 435 |460 | 20 15 390 | 385 | 20 39 260 | 508 15 60

02 | 490 | 450 | 16 50 389 | 385 | 45 35 463 | 366 | 25 48 | 390 |256| 20 |( 90

03 | 6BD | 583 | 45 43 684 | 490 | 39 48 568 | 594 | 36 39 | 436 |395 | 50 | 40

Q4 | 535|694) 39 38 335 |365 | 83 35 338 | 484 | 48 80 528 (483 | 35 50

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
& chenns/ 340 J 360 J 20 VAL
& Ne) a Kolkata
435 / aco / 20
Mum
we Ne 390 / 385 / / 30
Delhi

Q1] 260 508 15 0

ZQ2| 390 256 20 90 /


=
=3
n=2
2 Q3 | 43 396 50 40
E
04 | 528 483 35 50
Egg Milk Bread Biscuit
item (types)
UNIT V

SYSTEM & PROCESS MANAGERS

Data Warehousing System Managers: System Configuration Manager- System


Scheduling Manager - System Event Manager - System Database Manager -
System Backup Recovery Manager - Data Warehousing Process Managers:
Load Manager – Warehouse Manager- Query Manager – Tuning – Testing
Data Warehousing - System Managers

System management is mandatory for the successful implementation of a data


warehouse. The most important system managers are −

➢ System configuration manager


➢ System scheduling manager
➢ System event manager
➢ System database manager
➢ System backup recovery manager

System Configuration Manager :


● The system configuration manager is responsible for the management of the
setup and configuration of data warehouse.
● The structure of configuration manager varies from one operating system to
another.
● In Unix structure of configuration, the manager varies from vendor to vendor.
● Configuration managers have single user interface.
● The interface of configuration manager allows us to control all aspects of the
system.

Note − The most important configuration tool is the I/O manager.


System Scheduling Manager :
● The System Scheduling Manager is responsible for the successful
implementation of the data warehouse.
● Its purpose is to schedule ad hoc queries. Every operating system has its
own scheduler with some form of batch control mechanism.

The list of features a system scheduling manager must have is as follows −

❖ Work across cluster or MPP boundaries


❖ Deal with international time differences
❖ Handle job failure
❖ Handle multiple queries
❖ Support job priorities
❖ Restart or re-queue the failed jobs
❖ Notify the user or a process when job is completed
❖ Maintain the job schedules across system outages
❖ Re-queue jobs to other queues
❖ Support the stopping and starting of queues
❖ Log Queued jobs
❖ Deal with inter-queue processing

Note − The above list can be used as evaluation parameters for the
evaluation of a good scheduler.

Some important jobs that a scheduler must be able to handle are as follows −

➢ Daily and ad hoc query scheduling


➢ Execution of regular report requirements
➢ Data load
➢ Data processing
➢ Index creation
➢ Backup
➢ Aggregation creation
➢ Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the
system scheduling manager must be capable of running across the architecture.

System Event Manager:


● The event manager is a kind of software.
● The event manager manages the events that are defined on the data
warehouse system. We cannot manage the data warehouse manually
because the structure of the data warehouse is very complex.
● Therefore we need a tool that automatically handles all the events without
any intervention of the user.

Note − The Event manager monitors the events occurrences and deals with
them. The event manager also tracks the myriad of things that can go
wrong on this complex data warehouse system.

Events :

Events are the actions that are generated by the user or the system itself.
It may be noted that the event is a measurable, observable, occurrence of a defined
action.

Given below is a list of common events that are required to be tracked.

➔ Hardware failure
➔ Running out of space on certain key disks
➔ A process dying
➔ A process returning an error
➔ CPU usage exceeding an 805 threshold
➔ Internal contention on database serialization points
➔ Buffer cache hit ratios exceeding or failure below threshold
➔ A table reaching to maximum of its size
➔ Excessive memory swapping
➔ A table failing to extend due to lack of space
➔ Disk exhibiting I/O bottlenecks
➔ Usage of temporary or sort area reaching a certain thresholds
➔ Any other database shared memory usage
➔ The most important thing about events is that they should be capable of
executing on their own.
➔ Event packages define the procedures for the predefined events. The code
associated with each event is known as event handler.
➔ This code is executed whenever an event occurs.

System and Database Manager :


● System and database manager may be two separate pieces of software, but
they do the same job.
● The objective of these tools is to automate certain processes and to simplify
the execution of others.

The criteria for choosing a system and the database manager are as follows −

➔ increase user's quota.


➔ assign and de-assign roles and Profile to the users
➔ perform database space management
➔ monitor and report on space usage
➔ tidy up fragmented and unused space
➔ add and expand the space
➔ add and remove users, manage user password
➔ manage summary or temporary tables
➔ assign or deassign temporary space to and from the user
➔ reclaim the space form old or out-of-date temporary tables
➔ manage error and trace logs, to browse log and trace files
➔ redirect error or trace information
➔ switch on and off error and trace logging
➔ perform system space management
➔ monitor and report on space usage
➔ clean up old and unused file directories
➔ add or expand space.

System Backup Recovery Manager:

● The backup and recovery tool makes it easy for operations and management
staff to back-up the data.
● Note that the system backup manager must be integrated with the schedule
manager software being used.

The important features that are required for the management of backups are as
follows −

➔ Scheduling
➔ Backup data tracking
➔ Database awareness

Backups are taken only to protect against data loss. Following are the important
points to remember −

★ The backup software will keep some form of database of where and when the
piece of data was backed up.
★ The backup recovery manager must have a good front-end to that database.
★ The backup recovery software should be database aware.
★ Being aware of the database, the software then can be addressed in
database terms, and will not perform backups that would not be viable.
Data Warehousing - Process Managers

Process managers are responsible for maintaining the flow of data both into and out
of the data warehouse. There are three different types of process managers −

➢ Load manager
➢ Warehouse manager
➢ Query manager

Data Warehouse Load Manager

● Load manager performs the operations required to extract and load the data
into the database.
● The size and complexity of a load manager varies between specific solutions
from one data warehouse to another.

Load Manager Architecture

The load manager does performs the following functions −

● Extract data from the source system.


● Fast load the extracted data into temporary data store.
● Perform simple transformations into structure similar to the one in the data
warehouse.
● Extract Data from Source
❖ The data is extracted from the operational databases or the external
information providers. Gateways are the application programs that are used
to extract data.
❖ It is supported by underlying DBMS and allows the client program to
generate SQL to be executed at a server.
❖ Open Database Connection (ODBC) and Java Database Connection (JDBC)
are examples of gateway.

Fast Load

➢ In order to minimize the total load window, the data needs to be loaded into
the warehouse in the fastest possible time.
➢ Transformations affect the speed of data processing.
➢ It is more effective to load the data into a relational database prior to
applying transformations and checks.
➢ Gateway technology is not suitable, since they are inefficient when large data
volumes are involved.
Simple Transformations

● While loading, it may be required to perform simple transformations.


● After completing simple transformations, we can do complex checks.
● Suppose we are loading the EPOS sales transaction, we need to perform the
following checks −

1.Strip out all the columns that are not required within the warehouse.

2.Convert all the values to required data types.

Warehouse Manager:
● The warehouse manager is responsible for the warehouse management
process.
● It consists of a third-party system software, C programs, and shell scripts.
● The size and complexity of a warehouse manager varies between specific
solutions.

Warehouse Manager Architecture

A warehouse manager includes the following −

★ The controlling process


★ Stored procedures or C with SQL
★ Backup/Recovery tool
★ SQL scripts
Functions of Warehouse Manager:

A warehouse manager performs the following functions −

❖ Analyzes the data to perform consistency and referential integrity checks.


❖ Creates indexes, business views, partition views against the base data.
❖ Generates new aggregations and updates the existing aggregations.
❖ Generates normalizations.
❖ Transforms and merges the source data of the temporary store into the
published data warehouse.
❖ Backs up the data in the data warehouse.
❖ Archives the data that has reached the end of its captured life.

Note − A warehouse Manager analyzes query profiles to determine whether the


index and aggregations are appropriate.
Query Manager:
➢ The query manager is responsible for directing the queries to suitable tables.
➢ By directing the queries to appropriate tables, it speeds up the query
request and response process.
➢ In addition, the query manager is responsible for scheduling the execution of
the queries posted by the user.

Query Manager Architecture

A query manager includes the following components −

➔ Query redirection via C tool or RDBMS


➔ Stored procedures
➔ Query management tool
➔ Query scheduling via C tool or RDBMS
➔ Query scheduling via third-party software
Functions of Query Manager:

➢ It presents the data to the user in a form they understand.


➢ It schedules the execution of the queries posted by the end-user.
➢ It stores query profiles to allow the warehouse manager to determine which
indexes and aggregations are appropriate.

You might also like