Data Mining 4
Data Mining 4
1
Data warehouses-Overview
■ Data warehouses generalize and consolidate data in multidimensional
space.
■ The construction of data warehouses involves data cleaning, data
integration, and data transformation.
■ It can be viewed as an important preprocessing step for data mining.
■ It provides online analytical processing (OLAP) tools for the interactive
analysis of multidimensional data of varied granularities.
■ OLAP tools facilitates effective data generalization and data mining.
■ Many other DM functions, such as association, classification, prediction,
and clustering, can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
■ Hence, the DW has become an increasingly important platform for data
analysis and OLAP and will provide an effective platform for data mining.
■ Therefore, data warehousing and OLAP form an essential step in the
knowledge discovery process.
4
Data Warehouse—Subject-Oriented
5
Data Warehouse—Integrated
records
■ Data cleaning and data integration techniques are
applied
■ To ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
6
Data Warehouse—Time Variant
7
Data Warehouse—Nonvolatile
■ A DW is a physically separate store of data transformed
from the operational environment.
■ Due to this separation, operational update of data does
not occur in the data warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
8
OLTP vs. OLAP
9
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
10
* Data Mining: Concepts and Techniques 11
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
materialized 13
Extraction, Transformation, and Loading (ETL)
■ Data warehouse systems use back-end tools and utilities to populate
and refresh their data (Figure 4.1).
■ These tools and utilities include the following functions:
■ Data extraction
■ get data from multiple, heterogeneous, and external sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse format
■ Load
■ sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse
14
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
15
4.2 DataWarehouse Modeling: Data Cube
and OLAP
■ Data warehouses and OLAP tools are based on a multidimensional
data model, which views data in the form of a data cube.
■ A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
■ In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.
■ For example, AllElectronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location.
■ These dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations at which
the items were sold.
■ Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
■ For example, a dimension table for item may contain the attributes
item name, brand, and type.
17
■ Cubes are generally viewed as 3-D geometric structures, but in data
warehousing the data cube is n-dimensional.
■ To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D
data cube that is, in fact, a table or spreadsheet for sales data from
AllElectronics.
■ In particular, we will look at the AllElectronics sales data for items sold
per quarter in the city of Vancouver.
■ These data are shown in Table 4.2.
■ In this 2-D representation, the sales for Vancouver are shown with
respect to the time dimension (organized in quarters) and the item
dimension (organized according to the types of items sold).
■ The fact or measure displayed is dollars sold (in thousands).
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
24
4.2.2 Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
25
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
26
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
27
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
30
■ Many concept hierarchies are implicit within the database schema.
■ For example, suppose that the dimension location is described by
the attributes number, street, city, province or state, zip code, and
country.
■ These attributes are related by a total order, forming a concept
hierarchy such as “street < city < province or state < country.”
■ This hierarchy is shown in Figure 4.10(a).
■ Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice.
■ An example of a partial order for the time based on the attributes
day, week, month, quarter, and year is “day < fmonth < quarter;
weekg < year.”1
■ This lattice structure is shown in Figure 4.10(b).
■ A concept hierarchy that is a total or partial order among attributes
in a database schema is called a schema hierarchy.
■ Concept hierarchies that are common to many applications (e.g.,
for time) may be predefined in the data mining system.
■ To see how measures are computed, its first necessary to study how
measures can be categorized.
■ A data cube measure is a numeric function that can be evaluated
at each point in the data cube space.
■ A measure value is computed for a given point by aggregating the
data corresponding to the respective dimension–value pairs
defining the given point.
■ Measures can be organized into three categories—distributive,
algebraic, and holistic— based on the kind of aggregate functions
used.
■ Distributive: An aggregate function is distributive if the result derived
by applying the function to n aggregate values is the same as that
derived by applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
34
■ Algebraic: An aggregate function is algebraic if it can be computed by
an algebraic function with M arguments (where M is a bounded
integer), each of which is obtained by applying a distributive
aggregate function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: An aggregate function is holistic if there is no constant bound
on the storage size needed to describe a subaggregate.
■ E.g., median(), mode(), rank()
Office Day
Month
36
A Sample Data Cube
3Qtr 4Qtr
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
37
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
38
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
39
40
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
41
4.3 Design of Data Warehouse: A Business
Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
42
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
43
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
■ Data warehouses and data marts are used in a wide range of applications.
■ Business executives use the data in data warehouses and data
marts to perform data analysis and make strategic decisions.
■ In many firms, DW are used as an integral part of a plan-execute-assess
“closed-loop” feedback system for enterprise management.
■ DW are used extensively in banking and financial services, consumer
goods and retail distribution sectors, and controlled manufacturing
such as demand-based production.
■ Initially, the data warehouse is mainly used for generating reports and
answering predefined queries.
■ Progressively, it is used to analyze summarized and detailed data,
where the results are presented in the form of reports and charts.
■ Later, it is used for strategic purposes, performing multidimensional
analysis and sophisticated slice-and-dice operations.
47
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
■ The DM field has conducted research on various data types,
■ including relational data, data from data warehouses, transaction
data, time-series data, spatial data, text data, and flat files.
■ Multidimensional data mining (also known as exploratory
multidimensional data mining, online analytical mining, or OLAM)
integrates OLAP with data mining to uncover knowledge in
multidimensional databases.
■ Among the many different paradigms and architectures of DM
systems, OLAM is particularly important for the following reasons:
■ High quality of data in data warehouses
■ Most DM tools need to work on integrated, consistent, and
cleaned data, which requires costly data cleaning, data
integration, and data transformation as pre-processing steps.
■ A DW constructed by such pre-processing serves as a valuable
source of high-quality data for OLAP as well as for data mining.
■ Notice that data mining may serve as a valuable tool for data
cleaning and data integration as well 48
■ Available information processing structure surrounding data
warehouses
■ Information processing and data analysis infrastructures
51
■ A data cube is a lattice of cuboids.
■ Ex : Create a data cube for AllElectronics sales that contains the following:
city, item, year, and sales in dollars.
■ You want to be able to analyze the data, with queries such as the following:
■ “Compute the sum of sales, grouping by city and item.”
■ “Compute the sum of sales, grouping by city.”
■ “Compute the sum of sales, grouping by item.”
■ What is the total number of cuboids, or group-by’s, that can be computed
for this data cube?
■ Taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids,
or groupby’s, that can be computed for this data cube is 2^3 = 8.
■ The possible group-by’s are the following: {(city, item, year), (city, item),
(city, year), (item, year), (city), (item), (year), ()} where () means that the
group-by is empty (i.e., the dimensions are not grouped).
■ These group-by’s form a lattice of cuboids for the data cube, as shown in
Figure 4.14.
55
Indexing OLAP Data: Bitmap Index
■ It is advantageous compared to hash and tree indices.
■ It is especially useful for low-cardinality domains because
comparison, join, and aggregation operations are then reduced to
bit arithmetic, which substantially reduces the processing time.
■ not suitable for high cardinality domains
■ The length of the bit vector: # of records in the base table
■ The i-th bit is set if the i-th row of the base table has the value for
the indexed column
56
Indexing OLAP Data: Join Indices
■ The join indexing method gained popularity from its use in relational
database query processing.
■ Traditional indexing maps the value in a given column to a list of rows
having that value.
■ In contrast, join indexing registers the joinable rows of two relations
from a relational database.
■ For example, if two relations R(RID, A) and S(B, SID) join on the
attributes A and B, then the join index record contains the pair .RID,
SID/, where RID and SID are record identifiers from the R and S
relations, respectively.
■ Hence, the join index records can identify joinable tuples without
performing costly join operations.
■ Join indexing is especially useful for maintaining the relationship between
a foreign key and its matching primary keys, from the joinable relation.
■ The star schema model of data warehouses makes join indexing
attractive for crosstable search.
■ Join indices may span multiple dimensions to form composite join
indices.
* Data Mining: Concepts and Techniques 57
Efficient Processing OLAP Queries
58
OLAP Server Architectures
■ The physical architecture and implementation of OLAP servers must
consider data storage issues.
■ Implementations of a warehouse server for OLAP processing include the
following:
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake schemas
59