Module 1 (2)
Module 1 (2)
The OLTP systems have to provide a quick response to operational users and
business cannot afford to have response time suffer when a manager is
running a complex query.
Data Mart
A Data Mart is a subset of a data warehouse that is designed to focus on a
specific area or department of an organization, such as sales, finance,
marketing, or human resources.
Data marts are typically smaller in scope than a full enterprise data
warehouse (EDW) and are optimized to meet the needs of specific users or
business functions.
Data Warehouse: A Multi-Tiered Architecture
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as marketing
data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Extraction, Transformation, and Loading
(ETL)
• Data extraction
• get data from multiple, heterogeneous, and external sources
• Data cleaning
• detect errors in the data and rectify them when possible
• Data transformation
• convert data from legacy or host format to warehouse format
• Load
• sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
• Refresh
• propagate the updates from the data sources to the warehouse
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart locations and
contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of data
(active, archived, or purged), monitoring information (warehouse usage statistics, error
reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of dimension
tables
• Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
• Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema or fact
constellation
Example of Star Schema
Example of Snowflake Schema
Example of Fact Constellation(Galaxy
schema)
Data Warehouse Implementation
• Centralized
• Distributed
Steps:
• Requirement analysis and capacity planning
• Hardware integration
• Physical modeling
• Sources
• ETL
• Populate the data warehouse
• User application
• Roll-out the warehouse and application
DW Implementation Guidelines
• Build incrementally
• Need a champion
• Senior management support
• Ensure Quality
• Corporate strategy
• Business plan
• Training
• Adaptability
• Joint management
OLAP
• In 1993, E.F Codd presented this somewhat difficult to understand
definition of OLAP:
3) Data Cube - The data is stored in a structure called a data cube (even if it
may have more than three dimensions).
The cube allows for multidimensional analysis, enabling users to slice, dice,
drill down, or roll up the data for in-depth analysis.
Multi dimensional Data model (Contd..)
4) Hierarchies - Each dimension can have levels of granularity in the form of
hierarchies. For example, the Time dimension can have a hierarchy of Year →
Quarter → Month → Day. Users can analyze data at different levels of this
hierarchy (e.g., aggregate sales per month vs. sales per year).
From Tables and Spreadsheets to
Data Cubes
• A data warehouse is based on a multidimensional data model which views data in the form of a data
cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter,
year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension
tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
• Lattice - The lattice of cuboids is the structure of all possible cuboids that can be generated from a
multi-dimensional cube, based on different levels of aggregation.
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
2-D cuboids
time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier
2) Drill-down - This is like zooming-in on the data. This is the reverse of roll-up.
• when the user needs further details or → when the user wants to partition more
finely
• This adds more details to the data. • Initially, the time-hierarchy was "day < month
< quarter < year”
• On drill-down, the time dimension is descended from the level-of-quarter to the
level-of-month
Typical OLAP Operations (Contd..)
3) Slice and Dice – The slice operation performs a selection on one dimension
of the given cube, resulting in a subcube
4) Pivot (rotate) - This is used when the user wishes to re-orient the view of
the data-cube.
This may involve → swapping the rows and columns or → moving one of the
row-dimensions into the column-dimension.
Typical OLAP Operations (Contd..)
Other operations :
5) Drill-across – executes queries involving (i.e., across) more than one fact table.
FROM SALES
(city) (item) (year)
CUBE BY item, city, year
• Need compute the following Group-Bys
(date, product, customer),
(city, item) (city, year) (item, year)
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
() (city, item, year)
• Total number of cuboids computed for this data cube is 2^3 = 8
Indexing OLAP Data
• To facilitate efficient data accessing, most data warehouse systems
support index structures and materialized views (using cuboids)
• The bitmap indexing method is popular in OLAP products because it
allows quick searching in data cubes
• The bitmap index is an alternative representation of the
record_ID(RID) list.
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• Suitable for low cardinality domains
Limitations of OLAP cubes
• OLAP requires restructuring of data into a star/snowflake schema
• There is a limited number of dimensions (fields) a single OLAP cube
• It is nearly impossible to access transactional data in the OLAP cube
• Changes to an OLAP cube requires a full update of the cube – a
lengthy process
Indexing OLAP Data: Join Indices
• The join indexing method gained popularity from its use in relational
database query processing
• Traditional indexing maps the value in a given column to a list of rows having
that value
• In contrast, join indexing registers the joinable rows of two relations from a
relational database
• For example, if two relations R(RID, A) and S(B, SID) join on the attributes A
and B. Then the join index record contains the pair (RID, SID), where RID and
SID are record identifiers from the R and S relations respectively. Hence, the
join index records can identify joinable tuples without performing costly join
operations
• Join indexing is especially useful for maintaining the relationship between a
foreign key and its matching primary keys, from the joinable relation
Indexing OLAP Data: Join Indices (Contd..)
• The star schema model of data warehouses makes join indexing attractive for
cross table search. Because the linkage between a fact table and its
corresponding dimension tables comprises the fact table’s foreign key and
the dimension table’s primary key.
Indexing OLAP Data: Join Indices (Contd..)
• Linkages between a sales fact table and location, item dimension
tables
Indexing OLAP Data: Join Indices (Contd..)
• Join index tables based on the linkages between the sales fact table
and the location and item dimension tables shown in figure below
Efficient Processing OLAP Queries
• The purpose of materializing cuboids and constructing OLAP index
structures is to speed up the query processing in data cubes.
• Given materialized views, query processing should proceed as follows:
1) Determine which operations should be performed on the available
cuboids:
This involves transforming any selection, projection, roll-up (group-by),
and drill-down operations specified in the query into corresponding
SQL and/or OLAP operations
For example, slicing and dicing of a data cube may correspond to
selection and/or projection operations on a materialized cuboid
Efficient Processing OLAP Queries
2) Determine to which materialized cuboid(s) the relevant operations should
be applied:
• This involves identifying all of the materialized cuboids that may potentially
be used to answer the query,
• pruning the above set using knowledge of “dominance” relationships among
the cuboids,
• estimating the costs of using the remaining materialized cuboids, and
• selecting the cuboid with the least cost.
Example: Suppose that we define a data cube for AllElectronics of the
form”sales [time, item, location]: sum(sales_in_dollars)”.
The dimension hierarchies used are “day<month<quarter<year” for time
“item_name<brand<type” for item and
“street<city<province_or_state<country” for location
Query to be processed is on {brand, province_or_state}, with the selection constant
“year=2000” and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2000
Which of the abpve four cuboids should be selected to process the query?
• Cuboid 2 cannot be selected since country is more general concept than
province_or_state
• Cuboids 1, 3 and 4 can be used to process the query since
1) They have the same set or superset of the dimension in the query
2) The selection clause can imply selection in the cuboid
3) The abstraction levels for the item and location dimensions in these cuboids
are at a finer level than brand and province_or_state respectively
How would the costs of each cuboid compare if used to process the query?
• Cuboid 1 would cost the most since item_name and city are at lower level
• If there are not many year values associated with items in the cube, but there
are several item_names for each brand, then cuboid 3 will be smaller than 4
• If efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice