0% found this document useful (0 votes)
33 views31 pages

Data Warehouse Unit-3 Complete

Uploaded by

Sandeep Nayal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views31 pages

Data Warehouse Unit-3 Complete

Uploaded by

Sandeep Nayal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Warehouse

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and


focuses on providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only
to a particular group of users.

It is not used for daily operations and transaction processing but used for making
decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

○ It is a database designed for investigative tasks, using data from various


applications.
○ It supports a relatively small number of clients with relatively long
interactions.
○ It includes current and historical data to provide a historical perspective
of information.
○ Its usage is read-intensive.
○ It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."

Characteristics of Data Warehouse


Subject-Oriented
A data warehouse target on the modeling and analysis of data for
decision-makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product, or
sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS,
flat files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming
conventions, attributes types, etc., among different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the
most current file is kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not
occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into
the warehouse, and data should not change.

Goals of Data Warehousing


○ To help reporting as well as analysis
○ Maintain the organization's historical information
○ Be the foundation for decision making.

Need for Data Warehouse


Data Warehouse is needed for the following reasons:

1. 1) Business User: Business users require a data warehouse to view


summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouses contribute to making
strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring
uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant
degree of flexibility and quick response tim

Benefits of Data Warehouse


1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of
historical data.

Data Warehouse Delivery Process

Now we discuss the delivery process of the data warehouse. Main steps used in
data warehouse delivery process which are as follows:
Data Warehouse Delivery Process

1. IT Strategy

○ Develop a strategy for securing and retaining funding for the data
warehouse project.
2. Business Case Analysis

○ Understand the level of investment justified.


○ Identify projected business benefits from using the data warehouse.
3. Education & Prototyping

○ Educate stakeholders on the value of the data warehouse.


○ Use prototyping to demonstrate potential benefits and gather
feedback.
4. Business Requirements

○ Define the logical model for data.


○ Identify source systems and mapping rules.
○ Establish business rules and query profiles.
5. Technical Blueprint

○ Create an architecture plan for the data warehouse.


○ Define server and data mart architecture and essential database
components.
6. Building the Vision

○ Produce the first deliverable.


○ Create infrastructure for extracting and loading information from
sources.
7. History Load

○ Load the remaining historical data into the data warehouse.


○ Create additional tables to handle increased data volumes.
8. Ad-Hoc Query

○ Configure ad-hoc query tools for user access.


○ Enable end-users to generate queries for their needs.
9. Automation

○ Automate operational processes such as data extraction, loading,


and transformation.

Difference between Operational Database and


Data Warehouse
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the
business. The data frequently changes as updates are made and reflect the
current value of the last transactions.

Operational Database Management Systems also called OLTP (Online


Transactions Processing Databases), are used to manage dynamic data in
real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose of


data analysis and decision-making. Such systems can organize and present
information in specific formats to accommodate the diverse needs of various
users. These systems are called Online-Analytical Processing (OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases. However,
the goals of both these databases are different.

Operational Database Data warehouse

Operational systems are designed to Data warehousing systems are


support high-volume transaction typically designed to support
processing. high-volume analytical processing
(i.e., OLAP).

Operational systems are usually Data warehousing systems are


concerned with current data. usually concerned with historical
data.

Data within operational systems are Non-volatile, new data may be


mainly updated regularly according added regularly. Once Added rarely
to need. changed.

It is designed for real-time business It is designed for analysis of business


dealing and processes. measures by subject area,
categories, and attributes.
It is optimized for a simple set of It is optimized for extent loads and
transactions, generally adding or high, complex, unpredictable
retrieving a single row at a time per queries that access many rows per
table. table.

It is optimized for validation of Loaded with consistent, valid


incoming information during information, requires no real-time
transactions, and uses validation validation.
data tables.

It supports thousands of concurrent It supports a few concurrent clients


clients. relative to OLTP.

Operational systems are widely Data warehousing systems are


process-oriented. widely subject-oriented

Operational systems are usually Data warehousing systems are


optimized to perform fast inserts usually optimized to perform fast
and updates of associatively small retrievals of relatively high volumes
volumes of data. of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for Data Warehouse designed for


on-line transaction Processing on-line Analytical Processing (OLAP)
(OLTP)
Multidimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube


enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

The dimensions are the perspectives or entities concerning which an


organization keeps records. For example, a shop may create a sales data
warehouse to keep records of the store's sales for the dimension, time, item, and
location. These dimensions allow the save to keep track of things, for example,
monthly sales of items and the locations at which the items were sold. Each
dimension has a table related to it, called a dimensional table, which describes
the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for


example, sales. This theme is represented by a fact table. Facts are numerical
measures. The fact table contains the names of the facts or measures of the
related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are shown
for the time dimension (organized in quarters) and the item dimension (classified
according to the types of an item sold). The fact or measure displayed in
rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example,
suppose the data according to time and item, as well as the location is
considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are
shown in the table. The 3D data of the table are represented as a series of 2D
tables.

Data Cube
A data cube is a multidimensional structure used to organize and analyze data. It
is also known as a multidimensional database, materialized view, or OLAP
(On-Line Analytical Processing). The main goal of a data cube is to store
precomputed, frequently queried data to enhance retrieval efficiency.

To create a data cube, specific attributes from a database are chosen as measure
attributes (values of interest, such as sales amounts) and dimensions (attributes
used to organize the data, such as time, item, branch, and location). For example,
a sales data warehouse might track sales across dimensions like time, item,
branch, and location, allowing detailed analysis such as monthly sales by item
and location. Dimension tables describe these dimensions with attributes like
item_name, brand, and type.

Data cubes are useful for various analytical applications but can be sparse,
meaning not all cells in each dimension have corresponding data. Efficient
techniques are needed to handle this sparsity. In a multidimensional data model,
data cubes allow data to be viewed and analyzed from multiple perspectives. A
central fact table, which contains numerical measures like total sales and keys to
dimension tables, anchors the model. OLAP tools leverage this structure to
enable complex queries and analyses.

Despite their benefits, data cubes can face challenges, such as efficiently using
precomputed results when queries include lower-level constants. Nonetheless,
they remain a powerful tool for data organization and analysis in data
warehousing and business intelligence.

What is Schema
Schema is a logical description of the entire database.

It includes the name and description of records of all record types including all

associated data-Items and aggregates.

Much like a database, a data warehouse also requires maintaining a schema.

A database uses relational model, while a data warehouse uses Star, Snowflake,
and

Fact Constellation schema.

Modeling data warehouses: dimensions & measures


Star Schema:

A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.

A fact is an event that is counted or measured, such as a sale or log in. A


dimension includes reference data about the fact, such as date, item, or
customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model.

The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schema simulates a star,
with points, divergent from a central table.

The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.
Characteristics of Star Schema:

Every dimension in a star schema is represented with the only one-dimension


table.

The dimension table should contain the set of attributes.

The dimension table is joined to the fact table using a foreign key

The dimension table are not joined to each other

Fact table would contain key and measure

The Star schema is easy to understand and provides optimal disk usage.

The dimension tables are not normalized. For instance, in the above figure,
Country_ID

does not have a Country lookup table as an OLTP design would have.
The schema is widely supported by BI Tools.

Advantages:

(i) Simplest and Easiest

(ii) It optimizes navigation through database

(iii) Most suitable for Query Processing

Snowflake Schema
Definition:

● A snowflake schema is an expansion of the star schema where one or more dimension
tables do not connect directly to the fact table but join through other dimension tables.
● It is called a snowflake schema because its diagram resembles a snowflake.

Normalization:

● Snowflaking involves normalizing the dimension tables in a star schema.


● When all dimension tables are fully normalized, the structure looks like a snowflake with
the fact table in the middle.

Performance:

● Snowflaking is used to enhance the performance of specific queries.

Structure:

● The schema consists of one fact table surrounded by its associated dimensions.
● These dimensions are related to other dimensions, creating a branching snowflake
pattern.

Relationships:

● The snowflake schema has many dimension tables, which can be linked to other
dimension tables through a many-to-one relationship.

Normalization Level:

● Tables in a snowflake schema are generally normalized to the third normal form.

Hierarchy:
● Each dimension table represents exactly one level in a hierarchy.

Advantage of Snowflake Schema


1. The primary advantage of the snowflake schema is the development in
query performance due to minimizing disk storage requirements and
joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension
levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


1. The primary disadvantage of the snowflake schema is the additional
maintenance efforts required due to the increasing number of lookup
tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables join so more query execution time.

Difference between star and snowflake schema


Star Schema Snowflake Schema

In the star schema, The fact While in snowflake schema, The fact
tables and the dimension tables, dimension tables as well as sub
tables are contained. dimension tables are contained.

Star schema is a top-down


While it is a bottom-up model.
model.

Star schema uses more


While it uses less space.
space.

It takes less time for the While it takes more time than star
execution of queries. schema for the execution of queries.
In star schema, While in this, Both normalization and
Normalization is not used. denormalization are used.

Its design is very simple. While its design is complex.

While the query complexity of


The query complexity of the
snowflake schema is higher than star
star schema is low.
schema.

Its understanding is very


While its understanding is difficult.
simple.

While it has more number of foreign


It has fewer foreign keys.
keys.

It has high data redundancy. While it has low data redundancy.


Star schema
Snowflake schema

Fact Constellation Schema

A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.

Fact Constellation Schema describes a logical structure of a data warehouse or


data mart. Fact Constellation Schema can be designed with a collection of
denormalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to
summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.

Example: A constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along
four dimensions, namely, time, item, branch, and location. The schema contains a
fact table for sales that includes keys to each of the four dimensions, along with
two measures: Rupee_sold and units_sold. The shipping table has five
dimensions, or keys: item_key, time_key, shipper_key, from_location, and
to_location, and two measures: Rupee_cost and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation must
be considered and selected.

Concept Hierarchy

A concept hierarchy is a directed acyclic graph of ideas where each theory or idea is
identified by a unique name. In this hierarchy, an arc from concept A to concept B
indicates that A is a more general concept than B. Reports are tagged with concepts
that correspond to their content, and tagging a report with a concept also implicitly tags
it with all the ancestors in the concept hierarchy. Therefore, reports should be tagged
with the lowest possible concept to ensure accuracy.

The process of tagging reports to the hierarchy uses a top-down approach. An


evaluation function determines whether a record tagged to a node can also be tagged to
any of its child nodes, and the tag moves down the hierarchy until it cannot be pushed
further. This results in a hierarchy of reports, where each node contains a set of reports
related to a common concept. This hierarchy is useful for various text mining processes.

The concept hierarchy, assumed to be predefined (a priori), can also be created for
documents using hierarchical clustering algorithms. These hierarchies map specific,
low-level concepts to more general, high-level concepts and are used in data
warehouses to express different levels of granularity of an attribute. They are crucial for
formulating useful OLAP queries, allowing users to summarize data at various levels.
For example, using a location hierarchy, users can retrieve and summarize sales data
for each location, area, state, or country without needing to reorganize the data.
Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that includes-

Bottom Tier (Data Warehouse Server)


Middle Tier (OLAP Server)
Top Tier (Front end Tools).

Bottom Tier

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking
and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

Middle-tier

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions
on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

Top-tier

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.

Data Marting

A Data Mart is a subset of a directorial information store, generally oriented to a


specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization.
Data marts are derived from subsets of data in a data warehouse, though in the
bottom-up data warehouse design methodology, the data warehouse is created
from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI


is used to gather, store, access, and analyze records. It can be used by smaller
businesses to utilize the data they have accumulated since it is less expensive
than implementing a data warehouse.

Reasons for creating a data mart


○ Creates collective data by a group of users
○ Easy access to frequently needed data
○ Ease of creation
○ Improves end-user response time
○ Lower cost than implementing a complete data warehouses
○ Potential clients are more clearly defined than in a comprehensive data
warehouse
○ It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

○ Dependent Data Marts


○ Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse.
In this technique, firstly a data warehouse is created from which further various data marts
can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart;
therefore, there is no need for data mart integration. It is also known as a top-down
approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly independent
data marts are created, and then a data warehouse is designed using these
independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is
also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.

Hybrid Data Marts


It allows us to combine input from sources other than a data warehouse. This
could be helpful for many situations; especially when Adhoc integrations are
needed, such as after a new group or product is added to the organizations.

Steps in Implementing a Data Mart


The significant steps in implementing a data mart are to design the schema,
construct the physical storage, populate the data mart with data from source
systems, access it to make informed decisions and manage it over time. So, the
steps are:

Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data
about the requirements and developing the logical and physical design of the
data mart.

It involves the following tasks:


1. Gathering the business and technical requirements
2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Constructing
This step contains creating the physical database and logical structures
associated with the data mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces


associated with the data mart.
2. creating the schema objects such as tables and indexes describe in the
design step.
3. Determining how best to set up the tables and access structures.

Populating
This step includes all of the tasks related to getting data from the source,
cleaning it up, modifying it to the right format and level of detail, and moving it
into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources


2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing
This step involves putting the data to use: querying the data, analyzing it,
creating reports, charts and graphs and publishing them.

It involves the following tasks:

1. Set up an intermediate layer (Meta Layer) for the front-end tool to use. This
layer translates database operations and objects names into business
conditions so that the end-clients can interact with the data mart using
words which relate to the business functions.
2. Set up and manage database architectures like summarized tables which
help queries agree through the front-end tools execute rapidly and
efficiently.

Managing
This step contains managing the data mart over its lifetime. In this step,
management functions are performed as:

1. Providing secure access to the data.


2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data events with system failures.

Difference between Data Warehouse and Data


Mart

Data warehouse Data mart

A Data Warehouse is a vast repository of A data mart is an only


information collected from various organizations subtype of a Data
or departments within a corporation. Warehouse. It is
architecture to meet
the requirement of a
specific user group.

It may hold multiple subject areas. It holds only one


subject area. For
example, Finance or
Sales.

It holds very detailed information. It may hold more


summarized data.
Works to integrate all data sources It concentrates on
integrating data from
a given subject area or
set of source systems.

In data warehousing, Fact constellation is used. In Data Mart, Star


Schema and
Snowflake Schema are
used.

It is a Centralized System. It is a Decentralized


System.

Data Warehousing is data-oriented. Data Marts is


project-oriented.

Data Warehouse Process Architecture

The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture


In this architecture, the data is collected into single centralized storage and
processed upon completion by a single machine with a huge structure in terms
of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well


suited for small organizations with one location of service

It requires minimal resources both from people and system perspectives.


It is very successful when the collection and consumption of data occur at the
same location.

Distributed Process Architecture


In this architecture, information and its processing are allocated across data
centers, and its processing is distributed across data centers, and processing of
data is localized with the group of the results into centralized storage. Distributed
architectures are used to overcome the limitations of the centralized process
architectures where all the information needs to be collected to one central
location, and results are available in one central location.

There are several architectures of the distributed process:

Client-Server

In this architecture, the user does all the information collecting and presentation,
while the server does the processing and management of data.
Three-tier Architecture

With client-server architecture, the client machines need to be connected to a


server machine, thus mandating finite states and introducing latencies and
overhead in terms of record to be carried between clients and servers.

N-tier Architecture

The n-tier or multi-tier architecture is where clients, middleware, applications,


and servers are isolated into tiers.

Cluster Architecture

In this architecture, machines that are connected in network architecture


(software or hardware) to approximately work together to process information or
compute requirements in parallel. Each device in a cluster is associated with a
function that is processed locally, and the result sets are collected to a master
server that returns it to the user.

Peer-to-Peer Architecture

This is a type of architecture where there are no dedicated servers and clients.
Instead, all the processing responsibilities are allocated among all machines,
called peers. Each machine can perform the function of a client or server or just
process data.

You might also like