0% found this document useful (0 votes)
8 views

Module 1 (2)

The document provides an overview of data warehousing and OLAP, detailing the concepts, architecture, and operations involved in data warehouses. It explains the characteristics of data warehouses, such as being subject-oriented, integrated, time-variant, and non-volatile, and contrasts OLAP with OLTP systems. Additionally, it covers the importance of ETL processes, data modeling techniques, and the functionalities of OLAP systems for decision support and data analysis.

Uploaded by

Akhila K T
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 1 (2)

The document provides an overview of data warehousing and OLAP, detailing the concepts, architecture, and operations involved in data warehouses. It explains the characteristics of data warehouses, such as being subject-oriented, integrated, time-variant, and non-volatile, and contrasts OLAP with OLTP systems. Additionally, it covers the importance of ETL processes, data modeling techniques, and the functionalities of OLAP systems for decision support and data analysis.

Uploaded by

Akhila K T
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Module 1

Data warehousing and OLAP


Contents
• Data Warehouse basic concepts
• Data Warehouse Modeling
• Data cube and OLAP
 Characteristics of OLAP systems
 Multidimensional view and Data cube
 Data Cube Implementations
 Data Cube operations
 Implementation of OLAP and overview on OLAP software
 Typical OLAP Operations
What is Data Warehouse?
• A data warehouse is a centralized repository that stores large volumes
of data from multiple sources for analysis and reporting. Unlike
traditional databases, which are optimized for transactional processing,
data warehouses are specifically designed to support complex queries,
data analysis, and reporting, enabling organizations to make data-driven
decisions

• “A data warehouse is a subject-oriented, integrated, time-variant, and


nonvolatile collection of data in support of management’s decision-
making process.”—W. H. Inmon

• Data Warehousing – Process of constructing and using data warehouses


Data Warehouse – Subject Oriented
• Data is organized around specific subjects or areas of interest, such
as sales, finance, or customer information, rather than individual
transactions
• Focusing on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing
• Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process
Data Warehouse - Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.


• Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc

• When data is moved to the warehouse, it is converted


Data Warehouse – Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective (e.g.,
past 5-10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”
Data Warehouse – Non Volatile
• A physically separate store of data transformed from the operational
environment
• Operational update of data does not occur in the data warehouse
environment
• Does not require transaction processing, recovery, and concurrency
control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data
OLTP VS. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why a Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,
consolidation
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do
not typically maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of
data from heterogeneous sources
• data quality: different sources typically use inconsistent data representations,
codes and formats which have to be reconciled
• Note: There are more and more systems which perform OLAP analysis directly on
relational databases
Operational Data Stores(ODS)
• An ODS is designed to provide a consolidated view of the enterprise’s
current operational information
• An ODS has been defined by Inmon and Imhoff (1996) as follows

“An Operational Data Store is a subject-oriented, integrated, volatile, current


valued data store, containing only corporate detailed data”
• Subject-oriented (University- students, lecturers and courses)
• Integrated
• Volatile -> Data changes as new information refreshes the ODS
• Detailed

An ODS may be viewed as a short term memory


ODS Contd..
ODS – Reporting tool for administrative purpose (Sales total, orders filled)
ODS – Product and location codes
ODS – CRM (Customer Relationship Management)
ODS Design and Implementation
Why a Separate Database
ODS should be separate from the operational databases is that from time to
time complex queries are likely to degrade the performance of the OLTP
systems.

The OLTP systems have to provide a quick response to operational users and
business cannot afford to have response time suffer when a manager is
running a complex query.
Data Mart
A Data Mart is a subset of a data warehouse that is designed to focus on a
specific area or department of an organization, such as sales, finance,
marketing, or human resources.

Data marts are typically smaller in scope than a full enterprise data
warehouse (EDW) and are optimized to meet the needs of specific users or
business functions.
Data Warehouse: A Multi-Tiered Architecture
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as marketing
data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Extraction, Transformation, and Loading
(ETL)
• Data extraction
• get data from multiple, heterogeneous, and external sources
• Data cleaning
• detect errors in the data and rectify them when possible
• Data transformation
• convert data from legacy or host format to warehouse format
• Load
• sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
• Refresh
• propagate the updates from the data sources to the warehouse
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart locations and
contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of data
(active, archived, or purged), monitoring information (warehouse usage statistics, error
reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of dimension
tables
• Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
• Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema or fact
constellation
Example of Star Schema
Example of Snowflake Schema
Example of Fact Constellation(Galaxy
schema)
Data Warehouse Implementation
• Centralized
• Distributed

Steps:
• Requirement analysis and capacity planning
• Hardware integration
• Physical modeling
• Sources
• ETL
• Populate the data warehouse
• User application
• Roll-out the warehouse and application
DW Implementation Guidelines
• Build incrementally
• Need a champion
• Senior management support
• Ensure Quality
• Corporate strategy
• Business plan
• Training
• Adaptability
• Joint management
OLAP
• In 1993, E.F Codd presented this somewhat difficult to understand
definition of OLAP:

“OLAP is dynamic enterprise analysis required to create, manipulate,


animate and synthesise information from exegetical, contemplative and
formulaic data analysis models”

Exegetical – The information is manipulated from the point of view of a


manager
Contemplative – From the point of view of someone who has thought about
it
Formulaic - According to some formula
OLAP (Contd..)
• OLAP is software technology that enables analysts, managers and
executives to gain insight into data through fast, consistent, interactive
access to a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the
enterprise

• OLAP is fast analysis of shared multidimensional information for advanced


analysis.
• This definition is also called as FASMI, implies that most OLAP queries
should be answered within seconds.
CHARACTERISTICS OF OLAP SYSTEMS
• Users – select group of managers/dozens of users
• Functions – ad hoc driven and often much more complex operations.
• Nature – Involve complex queries to pull many records at a time and provide
summary/aggregate data to a manager
- OLAP apps often involve data stored in a data warehouse extracted
from many tables i.e., from more than one enterprise data base
• Design – view enterprise information as multidimensional
• Data- require historical data over several years since trends are often
important in decision making
• Kind of use – normally no data updates
FASMI Characteristics
• Fast – OLAP queries are answered very quickly (within seconds)
- Pre-compute the most commonly queried aggregates and compute
the remaining on-the-fly.
• Analytic – provide rich analytic functionality
- queries answered without any programming
• Shared – shared by hundreds of users
- should provide adequate security for confidentiality as well as
integrity
- concurrency control is required
• Multidimensional – whatever OLAP software is used, it must provide a
multidimensional conceptual view of data
FASMI Characteristics
• Information – obtain info from data warehouse
- should be able to handle large amount of input data
Codd’s OLAP Characteristics
• Codd’s et al’s 1993 paper listed 12 characteristics (rules) of OLAP systems.
Another six in 1995.
• All the 18 rules are available at https://www.olapreport.com/fasmi.htm

1) Multidimensional conceptual view – helps to carryout slice and dice


operations
2) Accessibility (OLAP as a mediator) – between data sources (e.g. a data
warehouse) and an OLAP front-end
3) Batch extraction vs interpretive – multidimensional data staging plus
partial precalculation of aggregates in last multidimensional databases.
Codd’s OLAP Characteristics (Contd..)
4) Multi-user support
5) Storing OLAP results – OLAP results data should be kept separate from
source data.
- Read-write OLAP applications should not be implemented directly on
live transaction data if OLAP systems are supplying info to the OLAP system
directly
6) Extraction of missing values – OLAP should distinguish missing values from
zero values to compute aggregate correctly
7) Treatment of missing values – ignoring missing values
8) Uniform reporting performance – increasing the number of dimensions or
database size should not degrade the reporting performance of OLAP system
9) Generic dimensionality – each dimension should be treated as equivalent
in structure as well as operational capabilities
Codd’s OLAP Characteristics (Contd..)
10) Unlimited dimensions and aggregation levels
Motivations for using OLAP
Examples to illustrate the types of information that OLAP tools can help
in discovering

1) Understanding and improving sales


2) Understanding and reducing costs of doing business
Multi dimensional Data model
• The multidimensional data model is a key concept in data warehousing
and OLAP (Online Analytical Processing) systems, designed to organize and
present data in a way that facilitates efficient querying and reporting. It
represents data in the form of a multi-dimensional structure, often
referred to as a data cube, which allows users to perform complex queries
and analyses on large datasets, particularly for decision-making purposes.
• Key concepts:
1) Dimensions - Dimensions are perspectives or entities with respect to
which an organization wants to keep records.
Dimensions are often organized into hierarchies. For example, in a Time
dimension, data can be analyzed by year, quarter, month, and day.
Multi dimensional Data model (Contd..)
2) Facts - Facts are the numerical measures or metrics that are analyzed in
relation to the dimensions. These could include values like sales revenue,
profit, quantity sold, or any other key performance indicator (KPI).
Facts are stored in a fact table, which typically contains keys referencing
related dimensions, along with the numerical values (metrics) being
measured.

3) Data Cube - The data is stored in a structure called a data cube (even if it
may have more than three dimensions).
The cube allows for multidimensional analysis, enabling users to slice, dice,
drill down, or roll up the data for in-depth analysis.
Multi dimensional Data model (Contd..)
4) Hierarchies - Each dimension can have levels of granularity in the form of
hierarchies. For example, the Time dimension can have a hierarchy of Year →
Quarter → Month → Day. Users can analyze data at different levels of this
hierarchy (e.g., aggregate sales per month vs. sales per year).
From Tables and Spreadsheets to
Data Cubes
• A data warehouse is based on a multidimensional data model which views data in the form of a data
cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter,
year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension
tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
• Lattice - The lattice of cuboids is the structure of all possible cuboids that can be generated from a
multi-dimensional cube, based on different levels of aggregation.
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier

2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier

4-D (base) cuboid


A Concept Hierarchy:
Dimension (location)
• A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts
Concept Hierarchies (Contd..)

(a) A hierarchy for location (total order)


(b) A lattice for time (partial order)

A concept hierarchy that is a total or partial order


among attributes in a database schema is called
a schema hierarchy.
Data Cube Measures: Three Categories
• Distributive: if the result derived by applying the function to n aggregate values is
the same as that derived by applying the function on all the data without
partitioning
• E.g., count(), sum(), min(), max()
• Algebraic: if it can be computed by an algebraic function with M arguments (where
M is a bounded integer), each of which is obtained by applying a distributive
aggregate function
• E.g., avg(), min_N(), standard_deviation()
• Holistic: if there is no constant bound on the storage size needed to describe a
subaggregate.
• E.g., median(), mode(), rank()
Multidimensional Data
• Sales volume as a function of product, month, and region
A Sample Data Cube
Cuboids Corresponding to the Cube
Typical OLAP Operations
1) Roll-up (Drill-up) - This is like zooming-out on the data-cube This is required
when the user needs further abstraction or less detail
• Initially, the location-hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location-hierarchy from the
level-of city to level-of- country

2) Drill-down - This is like zooming-in on the data. This is the reverse of roll-up.
• when the user needs further details or → when the user wants to partition more
finely
• This adds more details to the data. • Initially, the time-hierarchy was "day < month
< quarter < year”
• On drill-down, the time dimension is descended from the level-of-quarter to the
level-of-month
Typical OLAP Operations (Contd..)
3) Slice and Dice – The slice operation performs a selection on one dimension
of the given cube, resulting in a subcube

• The dice operation defines a subcube by performing a selection on two or


more dimensions

4) Pivot (rotate) - This is used when the user wishes to re-orient the view of
the data-cube.
This may involve → swapping the rows and columns or → moving one of the
row-dimensions into the column-dimension.
Typical OLAP Operations (Contd..)
Other operations :
5) Drill-across – executes queries involving (i.e., across) more than one fact table.

6) Drill-through – this operation makes use of relational SQL facilities to drill


through the bottom level of a data cube down to its back-end relational tables
A Star-Net Query Model
• The querying of multidimensional databases can be based on Starnet model.
• A starnet model consists of radial lines emanating from central point, where each
line represents a concept hierarchy for a dimension.
• Each abstraction level in the hierarchy is called a footprint.
• These represent the granularities available for use by OLAP operations such as
drill-down and roll-up.
Design of Data Warehouse: A Business
Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the data
warehouse
• Data source view
• exposes the information being captured, stored, and managed by operational
systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of end-user
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning (mature)
• Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
• Waterfall: structured and systematic analysis at each step before proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn
around
• Typical data warehouse design process
• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the grain (atomic level of data) of the business process , e.g., individual transactions
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record , e.g., dollars_sold and units_sold
Data Warehouse Development: A
Recommended Approach
Data Warehouse Usage
• Three kinds of data warehouse applications
• Information processing
• supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts and graphs
• Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing classification
and prediction, and presenting the mining results using visualization tools
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
• Why online analytical mining (OLAP mining)?
Integrates OLAP with data miming and mining knowledge in multidimensional
databases, is particularly important for following reasons:
• High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
• Available information processing infrastructure surrounding data warehouses
• ODBC, OLE (object linking and embedding) DB, Web accessing, service
facilities, reporting and OLAP tools
• OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
• On-line selection of data mining functions
• Integration and swapping of multiple mining functions, algorithms, and tasks
2D Representation
• In the 2-D representation, the All Electronics sales data for items sold per
quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).
3D Representation
• To view the data according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver
• The measured display in dollars sold (in thousands)
• The 3-D data of the table are represented as a series of 2-D tables
3D Data cube
Efficient Data Cube Computation
• Data cube can be viewed as a lattice of cuboids
• The bottom-most cuboid is the base cuboid
• The top-most cuboid (apex) contains only one cell
• How many cuboids in an n-dimensional cube with L levels?
n
T   ( Li 1)
i 1

• Materialization of data cube


• Materialize every (cuboid) (full materialization), none (no materialization), or some
(partial materialization)
• Selection of which cuboids to materialize
• Based on size, sharing, access frequency, etc.
The “Compute Cube” Operator
• Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount) ()

FROM SALES
(city) (item) (year)
CUBE BY item, city, year
• Need compute the following Group-Bys
(date, product, customer),
(city, item) (city, year) (item, year)
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
() (city, item, year)
• Total number of cuboids computed for this data cube is 2^3 = 8
Indexing OLAP Data
• To facilitate efficient data accessing, most data warehouse systems
support index structures and materialized views (using cuboids)
• The bitmap indexing method is popular in OLAP products because it
allows quick searching in data cubes
• The bitmap index is an alternative representation of the
record_ID(RID) list.
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• Suitable for low cardinality domains
Limitations of OLAP cubes
• OLAP requires restructuring of data into a star/snowflake schema
• There is a limited number of dimensions (fields) a single OLAP cube
• It is nearly impossible to access transactional data in the OLAP cube
• Changes to an OLAP cube requires a full update of the cube – a
lengthy process
Indexing OLAP Data: Join Indices
• The join indexing method gained popularity from its use in relational
database query processing
• Traditional indexing maps the value in a given column to a list of rows having
that value
• In contrast, join indexing registers the joinable rows of two relations from a
relational database
• For example, if two relations R(RID, A) and S(B, SID) join on the attributes A
and B. Then the join index record contains the pair (RID, SID), where RID and
SID are record identifiers from the R and S relations respectively. Hence, the
join index records can identify joinable tuples without performing costly join
operations
• Join indexing is especially useful for maintaining the relationship between a
foreign key and its matching primary keys, from the joinable relation
Indexing OLAP Data: Join Indices (Contd..)
• The star schema model of data warehouses makes join indexing attractive for
cross table search. Because the linkage between a fact table and its
corresponding dimension tables comprises the fact table’s foreign key and
the dimension table’s primary key.
Indexing OLAP Data: Join Indices (Contd..)
• Linkages between a sales fact table and location, item dimension
tables
Indexing OLAP Data: Join Indices (Contd..)
• Join index tables based on the linkages between the sales fact table
and the location and item dimension tables shown in figure below
Efficient Processing OLAP Queries
• The purpose of materializing cuboids and constructing OLAP index
structures is to speed up the query processing in data cubes.
• Given materialized views, query processing should proceed as follows:
1) Determine which operations should be performed on the available
cuboids:
This involves transforming any selection, projection, roll-up (group-by),
and drill-down operations specified in the query into corresponding
SQL and/or OLAP operations
For example, slicing and dicing of a data cube may correspond to
selection and/or projection operations on a materialized cuboid
Efficient Processing OLAP Queries
2) Determine to which materialized cuboid(s) the relevant operations should
be applied:
• This involves identifying all of the materialized cuboids that may potentially
be used to answer the query,
• pruning the above set using knowledge of “dominance” relationships among
the cuboids,
• estimating the costs of using the remaining materialized cuboids, and
• selecting the cuboid with the least cost.
Example: Suppose that we define a data cube for AllElectronics of the
form”sales [time, item, location]: sum(sales_in_dollars)”.
The dimension hierarchies used are “day<month<quarter<year” for time
“item_name<brand<type” for item and
“street<city<province_or_state<country” for location
Query to be processed is on {brand, province_or_state}, with the selection constant
“year=2000” and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2000
Which of the abpve four cuboids should be selected to process the query?
• Cuboid 2 cannot be selected since country is more general concept than
province_or_state
• Cuboids 1, 3 and 4 can be used to process the query since
1) They have the same set or superset of the dimension in the query
2) The selection clause can imply selection in the cuboid
3) The abstraction levels for the item and location dimensions in these cuboids
are at a finer level than brand and province_or_state respectively

How would the costs of each cuboid compare if used to process the query?
• Cuboid 1 would cost the most since item_name and city are at lower level
• If there are not many year values associated with items in the cube, but there
are several item_names for each brand, then cuboid 3 will be smaller than 4
• If efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice

You might also like