0% found this document useful (0 votes)
23 views60 pages

Unit 1-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views60 pages

Unit 1-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Unit-1

Data Warehousing

1
Points to discuss

■Data Warehouse: Basic Concepts


■Database vs Data Warehouse
■Data Warehouse Architecture/ Components
■Building a Data Warehouse
■Multi-Dimensional Data Models
■Data Warehouse Design
■Data Warehouse Usage
■Types of OLAP Servers
2
Data Warehouse: Basic
Concepts

3
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■A decision support database that is maintained
separately from the organization’s operational database
■ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses

4
Data Warehouse—Subject-
Oriented

■ Organized around major subjects, such as


customer, product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
■ Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

5
Data Warehouse—Integrated

■ Constructed by integrating multiple,


heterogeneous data sources
■relational databases, flat files, on-line
transaction records
■ Data cleaning and data integration techniques
are applied.
■Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
■E.g., Hotel price: currency, tax, breakfast covered,
etc.
■When data is moved to the warehouse, it is
converted.
6
Data Warehouse—Time Variant

■ The time horizon for the data warehouse is


significantly longer than that of operational
systems
■Operational database: current value data
■Data warehouse data: provide information from
a historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■Contains an element of time, explicitly or
implicitly
■But the key of operational data may or may not
contain “time element”
7
Data Warehouse—Nonvolatile

■ A physically separate store of data transformed


from the operational environment
■ Operational update of data does not occur in the
data warehouse environment
■Does not require transaction processing,
recovery, and concurrency control mechanisms
■Requires only two operations in data accessing:
■initial loading of data and access of data

8
Database (OLTP) vs Data
Warehouse (OLAP)

9
OLTP (Database) vs. OLAP (Data
Warehouse)

10
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation

■ Different functions and different data:


■ missing data: Decision support requires historical data
which operational DBs do not typically maintain
■ data consolidation: DS requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
■ data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled 11
Data Warehouse
Components/ Architecture

12
Data Warehouse: A Multi-Tiered
Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Extract
Operational Query
DBs Transform
Data Server
Reports
Load Warehouse Data mining
Refresh

Data
Marts
Data Sources Data Storage OLAP Engine Front-End Tools
13
Building a Data Warehouse

14
Extraction, Transformation, and
Loading (ETL)
■ Data extraction
■get data from multiple, heterogeneous, and
external sources
■ Data cleaning
■detect errors in the data and rectify them when
possible
■ Data transformation
■convert data from legacy or host format to
warehouse format
■ Load
■sort, summarize, consolidate, compute views,
check integrity, and build indicies and partitions
■ Refresh
■propagate the updates from the data sources to
the warehouse 15
Three Data Warehouse Models
■ Enterprise warehouse
■collects all of the information about subjects
spanning the entire organization
■ Data Mart
■a subset of corporate-wide data that is of value to
a specific groups of users. Its scope is confined
to specific, selected groups, such as marketing
data mart
■Independent vs. dependent (directly from warehouse)
data mart
■ Virtual warehouse
■A set of views over operational databases
■Only some of the possible summary views may16
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn,
data mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error
reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data
warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data 17
Relational data base technology for data warehouse

■Linear Speed up: refers the ability to


increase the number of processor to reduce
response time
■Linear Scale up: refers the ability to provide
same performance on the same requests as
the database size increases.

Data Mining: Concepts and


* Techniques 18
Types of parallelism

■Inter query Parallelism: In which different server


threads or processes handle multiple requests at the
same time.
■Intra query Parallelism: This form of parallelism
decomposes the serial SQL query into lower level
operations such as scan, join, sort etc. Then these
lower level operations are executed concurrently in
parallel.

Data Mining: Concepts and


* Techniques 19
Two ways to do Intra query parallelism

■ Horizontal parallelism: which means that the data base is


partitioned across multiple disks and parallel processing occurs
within a specific task that is performed concurrently on
different processors against different set of data.

■ Vertical parallelism: This occurs among different tasks. All


query components such as scan, join, sort etc are executed in
parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.

Data Mining: Concepts and


* Techniques 20
Horizontal parallelism
and Vertical parallelism

Data Mining: Concepts and


* Techniques 21
Data partitioning

■Data partitioning is the key component for effective


parallel execution of data base operations. There are
two ways do data partitioning:

■Random portioning
■Intelligent partitioning

Data Mining: Concepts and


* Techniques 22
Random portioning

■Includes random data striping across multiple disks on


a single server.

■Another option for random portioning is round robin


fashion partitioning in which each record is placed on
the next disk assigned to the data base.

Data Mining: Concepts and


* Techniques 23
Intelligent partitioning

■Assumes that DBMS knows where a specific record is


located and does not waste time searching for it across
all disks. The various intelligent partitioning include:
■Hash partitioning
■Key range partitioning
■Schema portioning
■User defined portioning

Data Mining: Concepts and


* Techniques 24
Hash partitioning

■A hash algorithm is used to calculate the partition


number based on the value of the partitioning key for
each row.

Data Mining: Concepts and


* Techniques 25
Key range partitioning

■Rows are placed and located in the partitions


according to the value of the partitioning key.
■That is all the rows with the key value from A to K are
in partition 1, L to T are in partition 2 and so on.

Data Mining: Concepts and


* Techniques 26
Schema portioning

■An entire table is placed on one disk; another table is


placed on different disk etc.
■This is useful for small reference tables.

Data Mining: Concepts and


* Techniques 27
User defined portioning

■It allows a table to be partitioned on the basis of a user


defined expression.

Data Mining: Concepts and


* Techniques 28
Data base architectures of parallel
processing
■There are three DBMS software architecture styles for
parallel processing:

■Shared Memory or shared everything Architecture


(SMA)
■Shared Disk architecture (SDA)
■Shred Nothing architecture (SNA)

Data Mining: Concepts and


* Techniques 29
Shared Memory Architecture (SMA)

Data Mining: Concepts and


* Techniques 30
Shared Memory Architecture (SMA)

■It is tightly coupled shared memory systems


■Characteristics:
■ Multiple PUs share memory
■ Each PU has full access to all shared memory through a common bus
■ Communication between nodes occurs via shared memory
■ Performance is limited by the bandwidth of the memory bus
■ It is simple to implement and provide a single system image,
implementing an RDBMS on SMP(symmetric multiprocessor)
■Disadvantage:
■ Scalability is limited by bus bandwidth and latency, and by available
memory

Data Mining: Concepts and


* Techniques 31
Shared Disk Architecture

Data Mining: Concepts and


* Techniques 32
Shared Disk Architecture (SDA)

■Shared disk systems are typically loosely coupled.


Such systems.
■Characteristics:
■ Each node consists of one or more PUs and associated memory
■ Memory is not shared between nodes
■ Communication occurs over a common high-speed bus
■ Each node has access to the same disks and other resources
■ A node can be an SMP if the hardware supports it
■ Bandwidth of the high-speed bus limits the number of nodes (scalability)
of the system
■ The Distributed Lock Manager (DLM ) is required

Data Mining: Concepts and


* Techniques 33
Advantages (SDA)

■Shared disk systems permit high availability.


■All data is accessible even if one node dies.
■These systems have the concept of one database,
which is an advantage over shared nothing systems.
■Shared disk systems provide for incremental growth.

Data Mining: Concepts and


* Techniques 34
Disadvantages (SDA)

■Inter-node synchronization is required, involving


DLM overhead and greater dependency on high-speed
interconnect.
■If the workload is not partitioned well, there may be
high synchronization overhead.

Data Mining: Concepts and


* Techniques 35
Shared Nothing Architecture (SNA)

Data Mining: Concepts and


* Techniques 36
Shared Nothing Architecture (SNA)

■Shared nothing systems are typically loosely coupled.


■In shared nothing systems, only one CPU is connected
to a given disk, if a table or database is located on that
disk.
■Shared nothing systems are concerned with access to
disks, not access to memory
■Adding more PUs and disks can improve scale up

Data Mining: Concepts and


* Techniques 37
Advantages & disadvantages of
SNA
■Advantages:
■Shared nothing systems provide for incremental growth.
■System growth is practically unlimited.
■Massive Parallel Processing (MPP) systems are good for read-
only databases and decision support applications.
■Failure is local: if one node fails, the others stay up.
■ Disadvantages:
■More coordination is required.
■More overhead is required for a process working on a disk
belonging to another node.
■If there is a heavy workload of updates or inserts, as in an online
transaction processing system, it may be worthwhile to consider
data-dependent routing to alleviate contention.
Data Mining: Concepts and
* Techniques 38
Multi-Dimensional Data
Models

39
Multidimensional Data

■Sales volume as a function of product,


month, and region
Dimensions: Product, Location,
Time
n

Hierarchical summarization paths


o
gi

Industry Region Year


Re

Category Country Quarter


Produc

Product City Month Week


t

Office Day

Mont
h 40
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
■ Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold)
and keys to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
41
Cube: A Lattice of Cuboids

all
0-D (apex)
cuboid
tim ite locatio supplie
e m n r 1-D
cuboids
time,location item,location location,supplier
time,item 2-D
time,supplier item,supplier cuboids
time,location,supplie
r 3-D
time,item,location cuboids
time,item,supplie item,location,supplier
r
4-D (base)
time, item, location, supplier cuboid

42
Conceptual Modeling of Data
Warehouses
■ Modeling data warehouses: dimensions &
measures
■Star schema: A fact table in the middle
connected to a set of dimension tables
■Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
■Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
43
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch branch_key location
branch_key location_key
location_key street
branch_name
branch_type units_sold city
state_or_province
dollars_sold country

Measures avg_sales

44
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch branch_key location
location_key
branch_key location_key street
branch_name
city_key
branch_type units_sold city
dollars_sold city_key
city
Measures avg_sales state_or_province
country

45
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key
from_location
branch location
branch_key location_key to_location
location_key
branch_name street dollars_cost
branch_type units_sold
city
dollars_sold province_or_state units_shipped
country shipper
Measures avg_sales
shipper_key
shipper_name
location_key
shipper_type 46
A Concept Hierarchy:
Dimension (location)

al al
l l

regio Europe ... North_Americ


n a

countr German ... Spain Canad ... Mexic


y y a o

cit Frankfur ... Vancouver ... Toronto


y t

office L. Chan ... M. Wind

47
Data Cube Measures: Three
Categories

■ Distributive: if the result derived by applying the


function to n aggregate values is the same as that
derived by applying the function on all the data
without partitioning
■E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function
■E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the
storage size needed to describe a subaggregate.
■E.g., median(), mode(), rank() 48
A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qt 2Qt 3Qt 4Qt su
t
uc

TV r
r r r m
od

PC U.S.
Pr

VCR A

Country
su
Canad
m
a
Mexic
o
su
m

49
Cuboids Corresponding to the Cube

al
l 0-D (apex)
product countr cuboid
date
y 1-D
cuboids
product,dat product,countr date,
e y country 2-D
cuboids

3-D (base)
product, date, cuboid
country

50
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■by climbing up hierarchy or by dimension
reduction
■ Drill down (roll down): reverse of roll-up
■from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■reorient the cube, visualization, 3D to series of 2D
planes
■ Other operations
■drill across: involving (across) more than one fact
table 51
Data Warehouse Design

52
Design of Data Warehouse: A
Business Analysis Framework
■ Four views regarding the design of a data
warehouse
■Top-down view
■allows selection of the relevant information necessary
for the data warehouse
■Data source view
■exposes the information being captured, stored, and
managed by operational systems
■Data warehouse view
■consists of fact tables and dimension tables
■Business query view
■sees the perspectives of data in the warehouse from
the view of end-user
53
Data Warehouse Design
Process
■ Top-down, bottom-up approaches or a combination of
both
■ Top-down: Starts with overall design and planning
(mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step
before proceeding to the next
■ Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices,
etc.
■ Choose the grain (atomic level of data) of the business
process 54
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


55
Data Warehouse Usage

56
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
■ Analytical processing
■multidimensional analysis of data warehouse data
■supports basic OLAP operations, slice-dice, drilling,
pivoting
■ Data mining
■knowledge discovery from hidden patterns
■supports associations, constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools
57
Types of OLAP Servers

58
OLAP Server Architectures

■ Relational OLAP (ROLAP)


■ Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake
59
Concept Description vs. Cube-
Based OLAP
■ Similarity:
■Data generalization
■Presentation of data summarization at multiple
levels of abstraction
■Interactive drilling, pivoting, slicing and dicing
■ Differences:
■OLAP has systematic preprocessing, query
independent, and can drill down to rather low
level
■AOI has automated desired level allocation, and
may perform dimension relevance
analysis/ranking when there are many relevant
dimensions
■AOI works on the data which are not in relational 60

You might also like