0% found this document useful (0 votes)
39 views71 pages

Data Warehousing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views71 pages

Data Warehousing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Data Warehousing

Data Warehouse (DW)


• A data warehouse is a centralized repository for storing and managing large
amounts of data from various sources for analysis and reporting.
• It includes historical data derived from transaction data from single and
multiple sources
• A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modeling and
analysis.
• "Data Warehouse is a subject-oriented, integrated, and time-variant store
of information in support of management's decisions."
Characteristics of Data Warehouse
Characteristics of Data Warehouse

Subject Oriented
• A data warehouse is also
subject-oriented, which means
that the data is organized around
specific subjects.
• This allows for easy access to the
data relevant to a specific
subject, as well as the ability to
track the data over time.
• ex- sales information,
customer information
Characteristics of Data Warehouse
Integrated
Integrated data means a data
warehouse stores data from
multiple sources by standardizing
and formatting all data into a
single, consistent format to
support accurate reporting and
analysis.
Characteristics of Data Warehouse

Time Variant
Historical information is kept in a
data warehouse. For example, one
can retrieve files from 3 months, 6
months, 12 months, or even
previous data from a data
warehouse. These variations with a
transactions system, where often
only the most current file is kept.
Characteristics of Data Warehouse
Non-volatile
Another characteristic of a data
warehouse is that it is non-
volatile. This means that the data
in the warehouse is never updated
or deleted, only added to. This is
important because it allows for
the preservation of historical data,
making it possible to track trends
and patterns over time.
Example Applications of Data
Warehousing
• Data Warehousing can be applied anywhere where we have a huge
amount of data and we want to see statistical results that help in decision
making.
• Social Media Websites: The social networking websites like Facebook,
Twitter, Linkedin, etc. are based on analyzing large data sets. These sites
gather data related to members, groups, locations, etc., and store it in a
single central repository. Being a large amount of data, Data Warehouse is
needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the
spending patterns of account/cardholders. They use this to provide them
with special offers, deals, etc.
• Government: Government uses a data warehouse to store and analyze tax
payments which are used to detect tax thefts.
Need for Data Warehouse
Need for Data Warehouse

1) Business User: Business users require a data warehouse to view summarized


data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data
in the data warehouse. So, data warehouse contributes to making strategic
decisions.
4) For data consistency and quality: Bringing the data from different sources at
a commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree of
flexibility and quick response time.
Advantages:

• Improved data quality: Data warehousing can help improve data quality
by consolidating data from various sources into a single, consistent view.
• Faster access to information: Data warehousing enables quick access to
information, allowing businesses to make better, more informed
decisions faster.
• Better decision-making: With a data warehouse, businesses can analyze
data and gain insights into trends and patterns that can inform better
decision-making.
• Reduced data redundancy: By consolidating data from various sources,
data warehousing can reduce data redundancy and inconsistencies.
• Scalability: Data warehousing is highly scalable and can handle large
amounts of data from different sources.
Disadvantages:

• Cost: Building a data warehouse can be expensive, requiring significant


investments in hardware, software, and personnel.
• Complexity: Data warehousing can be complex, and businesses may
need to hire specialized personnel to manage the system.
• Time-consuming: Building a data warehouse can take a significant
amount of time, requiring businesses to be patient and committed to
the process.
• Data integration challenges: Data from different sources can be
challenging to integrate, requiring significant effort to ensure
consistency and accuracy.
• Data security: Data warehousing can pose data security risks, and
businesses must take measures to protect sensitive data from
unauthorized access or breaches.
Difference between Database and Data
warehouse
Types of Data Warehouse

Three main types of Data Warehouses (DWH) are:

1. Enterprise Data Warehouse (EDW):

Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service across
the enterprise. It offers a unified approach for organizing and representing data. It also provide the ability to
classify data according to the subject and give access according to those divisions.

2. Operational Data Store:

Operational Data Store, which is also called ODS, are nothing but data store required when neither Data
warehouse nor OLTP systems support organizations reporting needs. In ODS, Data warehouse is refreshed in real
time. Hence, it is widely preferred for routine activities like storing records of the Employees.

3. Data Mart:

A data mart is a subset of the data warehouse. It specially designed for a particular line of business,
such as sales, finance, sales or finance. In an independent data mart, data can collect directly from sources.
Enterprise Data Warehouse (EDW)
Operational Data Store
Data Mart
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data
warehousing to refer to a system that is used to process the
day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which
transactional data is stored, and every file in the system
must have a different name.
Meta Data
• A set of data that defines and gives information about
other data.
• Meta Data summarizes necessary information about
data,
• For example, author, data build, and data changed, and
file size are examples of very basic document metadata.
Data Warehouse Architecture: Basic
Lightly and highly summarized data
The area of the data warehouse saves all the
predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
End-User access Tools
The principal purpose of a data warehouse is to
provide information to the business managers for strategic
decision-making. These customers interact with the
warehouse using end-client access tools.
The examples of some of the end-user access tools can be
• Reporting and Query Tools
• Application Development Tools
• Executive Information Systems Tools
• Online Analytical Processing Tools
• Data Mining Tools
Data Warehouse Architecture: With Staging Area
• We must clean and process your operational
information before put it into the warehouse.
• We can do this programmatically, although
data warehouses uses a staging area (A place
where data is processed before entering the
warehouse).
• A staging area simplifies data cleansing and
consolidation for operational method coming
from multiple source systems, especially for
enterprise data warehouses where all relevant
data of an enterprise is consolidated.
• Data Warehouse Staging Area is a temporary
location where a record from source systems
is copied.
Data Warehouse Architecture: With Staging Area and
Data Marts
We may want to customize our warehouse's
architecture for multiple groups within our
organization.
• We can do this by adding data marts. A data
mart is a segment of a data warehouses that
can provided information for reporting and
analysis on a section, unit, department or
operation in the company, e.g., sales, payroll,
production, etc.
• The figure illustrates an example where
purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to
analyze historical data for purchases and sales
or mine historical information to make
predictions about customer behavior.
Properties of Data Warehouse Architectures
Properties of Data Warehouse Architectures
• 1. Separation: Analytical and transactional processing should be keep
apart as much as possible.
• 2. Scalability: Hardware and software architectures should be simple to
upgrade the data volume, which has to be managed and processed, and
the number of user's requirements, which have to be met, progressively
increase.
• 3. Extensibility: The architecture should be able to perform new
operations and technologies without redesigning the whole system.
• 4. Security: Monitoring accesses are necessary because of the strategic
data stored in the data warehouses.
• 5. Administerability: Data Warehouse management should not be
complicated.
Dimensional Modeling

• Dimensional modeling represents data with a cube operation, making


more suitable logical data representation with OLAP data management.
• The perception of Dimensional Modeling was developed by Ralph
Kimball and is consist of "fact" and "dimension" tables.
• In dimensional modeling, the transaction record is divided into
either "facts," which are frequently numerical transaction data,
or "dimensions," which are the reference information that gives context
to the facts.
• For example, a sale transaction can be damage into facts such as the
number of products ordered and the price paid for the products, and into
dimensions such as order date, user name, product number, order ship-to,
and bill-to locations, and salesman responsible for receiving the order.
Objectives of Dimensional Modeling

• The purposes of dimensional modeling are:


• To produce database architecture that is easy for end-clients to
understand and write queries.
• To maximize the efficiency of queries. It achieves these goals by
minimizing the number of tables and relationships between them.
Elements of Dimensional Modeling

Fact
It is a collection of associated data items, consisting of measures and context data. It typically
represents business items or business transactions.
Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the
contextual background for the facts, and they are the framework over which OLAP is performed.
Measure
It is a numeric attribute of a fact, representing the performance or behavior of the business
relative to the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data
elements that are of interest to the company.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe
the facts.
Example of Fact and Dimension Table
Data Cube

• When data is grouped or combined in multidimensional matrices


called Data Cubes. The data cube method has a few alternative names
or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."
Multi-Dimensional Data Model

• A multidimensional model views data in the form of a data-cube. A


data cube enables data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
• For example, a shop may create a sales data warehouse to keep
records of the store's sales for the dimension time, item, and location.
These dimensions allow the save to keep track of things, for example,
monthly sales of items and the locations at which the items were
sold. Each dimension has a table related to it, called a dimensional
table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item_name,
brand, and type.
Multi-Dimensional Data Model
• A multidimensional data model
is organized around a central
theme, for example, sales. This
theme is represented by a fact
table. Facts are numerical
measures. The fact table
contains the names of the facts
or measures of the related
dimensional tables.
Multi-Dimensional Data Model
• Consider the data of a shop for items sold per quarter in the city of
Delhi. The data is shown in the table. In this 2D representation, the
sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of
an item sold). The fact or measure displayed in rupee_sold (in
thousands).
Star Schema

• A star schema is the elementary form of a dimensional model, in


which data are organized into facts and dimensions.
• A star schema is a relational schema where a relational schema
whose design represents a multidimensional data model.
• The star schema is the explicit data warehouse schema.
• It is known as star schema because the entity-relationship diagram of
this schemas simulates a star, with points, diverge from a central
table.
Star Schema
Fact Tables
• A table in a star schema which contains facts and connected to
dimensions. A fact table has two types of columns: those that include
fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all
of its foreign keys.
Dimension Tables
• Fact tables store data about sales while dimension tables data about
the geographic region (markets, cities), clients, products, times,
channels.
• The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table.
Advantages of Star Schema

Advantages of Star Schema


Query Performance
 A star schema database has a limited number of table and clear join paths, the query run faster
 Large join queries that contain multiple tables takes only seconds or minutes to run.
Load performance and administration
 By describing facts and dimensions and separating them into the various table, the impact of a load
structure is reduced.
 Dimension table can be populated once and occasionally refreshed. We can add new facts regularly
and selectively by appending records to a fact table.
Built-in referential integrity
 A star schema has referential integrity built-in when information is loaded.
 Referential integrity is enforced because each data in dimensional tables has a unique primary key, and
all keys in the fact table are legitimate foreign keys drawn from the dimension table.
Easily Understood
 A star schema is simple to understand and navigate, with dimensions joined only through the fact
table.
 Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the
relationship between the user, and bank account cannot describe as
star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES,
and several dimension tables connected to it for time, branch, item,
and geographic locations.
Snowflake Schema

• A snowflake schema is equivalent to the star schema. "A schema is


known as a snowflake if one or more dimension tables do not connect
directly to the fact table but must join through other dimension
tables.“
• Snowflaking is a method of normalizing the dimension tables in a
STAR schemas.
• Snowflaking is used to develop the performance of specific queries.
• The snowflake schema consists of one fact table which is linked to
many dimension tables, which can be linked to other dimension
tables through a many-to-one relationship.
Snowflake Schema
• The following diagram shows a snowflake schema with two dimensions, each having three levels.
A snowflake schemas can have any number of dimension, and each dimension can have any
number of levels.
Advantage of Snowflake Schema

• The primary advantage of the snowflake schema is the development


in query performance due to minimized disk storage requirements
and joining smaller lookup tables.
• It provides greater scalability in the interrelationship between
dimension levels and components.
• No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema

• The primary disadvantage of the snowflake schema is the additional


maintenance efforts required due to the increasing number of lookup
tables. It is also known as a multi fact star schema.
• There are more complex queries and hence, difficult to understand.
• More tables more join so more query execution time.
Key Differences Between Star and Snowflake Schema

• Star schema contains just one dimension table for one dimension entry while there
may exist dimension and sub-dimension table for one entry.
• Normalization is used in snowflake schema which eliminates the data redundancy. As
against, normalization is not performed in star schema which results in data
redundancy.
• Star schema is simple, easy to understand and involves less intricate queries. On the
contrary, snowflake schema is hard to understand and involves complex queries.
• The data model approach used in a star schema is top-down whereas snowflake
schema uses bottom-up.
• Star schema uses a fewer number of joins. On the other hand, snowflake schema uses a
large number of joins.
• The space consumed by star schema is more as compared to snowflake schema.
• The time consumed for executing a query in a star schema is less. Conversely, snowflake
schema consumes more time due to the excessive use of joins.
OLAP(Online Analytical Processing)

• OLAP stands for On-Line Analytical Processing.


• OLAP is a classification of software technology which authorizes
analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of
possible views of data that has been transformed from raw
information to reflect the real dimensionality of the enterprise as
understood by the clients.
Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.


Finance and accounting:
• Budgeting
• Activity-based costing
• Financial performance analysis
• And financial modeling
Sales and Marketing
• Sales analysis and forecasting
• Market research analysis
• Promotion analysis
• Customer analysis
• Market and customer segmentation
Production
• Production planning
• Defect analysis
Characteristics of OLAP
Characteristics of OLAP
• Multidimensional conceptual view: OLAP systems let business users have a
dimensional and logical view of the data in the data warehouse. It helps in
carrying slice and dice operations.
• Multi-User Support: Since the OLAP techniques are shared, the OLAP
operation should provide normal database operations, containing retrieval,
update, adequacy control, integrity, and security.
• Accessibility: OLAP acts as a mediator between data warehouses and front-
end. The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.
• Storing OLAP results: OLAP results are kept separate from data sources.
• Uniform documenting performance: Increasing the number of dimensions or
database size should not significantly degrade the reporting performance of
the OLAP system.
OLAP Operations in the Multidimensional Data Model

• In the multidimensional model, the records are organized into various


dimensions, and each dimension includes multiple levels of
abstraction described by concept hierarchies.
• Organization support users with the flexibility to view data from
various perspectives.
• A number of OLAP data cube operation exist to demonstrate these
different views, allowing interactive queries and search of the record
at hand. Hence, OLAP supports a user-friendly environment for
interactive data analysis.
Roll-Up

• The roll-up operation (also known as drill-up or aggregation


operation) performs aggregation on a data cube, by climbing down
concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-
out on the data cubes.
• When a roll-up is performed by dimensions reduction, one or more
dimensions are removed from the cube.
• For example, consider a sales data cube having two dimensions,
location and time. Roll-up may be performed by removing, the time
dimensions, appearing in an aggregation of the total sales by location,
relatively than by location and by time.
Drill-Down

• The drill-down operation (also called roll-down) is the reverse


operation of roll-up.
• Drill-down is like zooming-in on the data cube. It navigates from less
detailed record to more detailed data.
• Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.
• Figure shows a drill-down operation performed on the dimension time
by stepping down a concept hierarchy which is defined as day, month,
quarter, and year. Drill-down appears by descending the time hierarchy
from the level of the quarter to a more detailed level of the month.
Slice

• A slice is a subset of the cubes corresponding to a single value for one


or more members of the dimension.
• For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a
two-dimensional site. So, the Slice operations perform a selection on
one dimension of the given cube, thus resulting in a subcube.
Dice

• The dice operation describes a subcube by operating a selection on


two or more dimension.
Pivot

• The pivot operation is also called a rotation.


• Pivot is a visualization operations which rotates the data axes in view
to provide an alternative presentation of the data.
• It may contain swapping the rows and columns or moving one of the
row-dimensions into the column dimensions.
OLAP Engine

• The main function of the OLAP engine is to present the user a


multidimensional view of the data warehouse and to provide tools
for OLAP operations.
• The following are the types of OLAP Servers.
Relational OLAP (ROLAP) Server

• These are intermediate servers which stand in between a relational back-end server and user
frontend tools.
• They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces.
• ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
• ROLAP technology tends to have higher scalability than MOLAP technology.
• ROLAP systems work primarily from the data that resides in a relational database, where the base
data and dimension tables are stored as relational tables. This model permits the multidimensional
analysis of data.
• This technique relies on manipulating the data stored in the relational database to give the presence
of traditional OLAP's slicing and dicing functionality. In essence, each method of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Relational OLAP (ROLAP) Server
Advantages

• Can handle large amounts of information: The data size limitation of ROLAP technology is depends on the data size

of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.

• <="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the RDBMS)

can control these functionalities.

Disadvantages

• Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational database, the

query time can be prolonged if the underlying data size is large.

• Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query the relational

database, and SQL statements do not suit all needs.


Multidimensional OLAP (MOLAP) Server

• A MOLAP system is based on a native logical model that directly supports


multidimensional data and operations. Data are stored physically into
multidimensional arrays, and positional techniques are used to access them.

• One of the significant distinctions of MOLAP against a ROLAP is that


data are summarized and are stored in an optimized format in a
multidimensional cube, instead of in a relational database. In MOLAP
model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.
Multidimensional OLAP (MOLAP) Server
Advantages
• Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal
for slicing and dicing operations.

• Can perform complex calculations: All evaluation have been pre-generated when the cube
is created. Hence, complex calculations are not only possible, but they return quickly.

Disadvantages
• Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in the
cube itself.

• Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.
Hybrid OLAP (HOLAP) Server

• HOLAP incorporates the best features of MOLAP and ROLAP into a


single architecture.
• HOLAP systems save more substantial quantities of detailed data in
the relational tables while the aggregations are stored in the pre-
calculated cubes.
• HOLAP also can drill through from the cube down to the relational
tables for delineated data. The Microsoft SQL Server 2000 provides a
hybrid OLAP server.
Hybrid OLAP (HOLAP) Server

Advantages of HOLAP
1.HOLAP provide benefits of both MOLAP and ROLAP.

2.It provides fast access at all levels of aggregation.

3.HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the relational
database. So no duplicate copy of the detail record is maintained.

Disadvantages of HOLAP
3.HOLAP architecture is very complicated because it supports both MOLAP
and ROLAP servers.
Other Types of OLAP Servers

Other Types

There are also less popular types of OLAP styles upon which one could stumble upon every so often. We have listed some of the less popular
brands existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server

WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional client/server OLAP applications, WOLAP
is considered to have a three-tiered architecture which consists of three components: a client, a middleware, and a database server.

Desktop OLAP (DOLAP) Server

DOLAP permits a user to download a section of the data from the database or source, and work with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server

Mobile OLAP enables users to access and work on OLAP data and applications remotely through the use of their mobile devices.

Spatial OLAP (SOLAP) Server

SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a single user interface. It facilitates the
management of both spatial and non-spatial data.
ROLAP MOLAP

ROLAP stands for Relational Online Analytical Processing. MOLAP stands for Multidimensional Online Analytical Processing.

It usually used when data warehouse contains relational data. It used when data warehouse contains relational as well as non-relational
data.

It contains Analytical server. It contains the MDDB server.

It creates a multidimensional view of data dynamically. It contains prefabricated data cubes.

It is very easy to implement It is difficult to implement.

It has a high response time It has less response time due to prefabricated cubes.

It requires less amount of memory. It requires a large amount of memory.

You might also like