Data Warehousing and Mining
Data Warehousing and Mining
• Overview
• Data Warehousing
• Online Analytical Processing
• Data Mining
Overview
• Data analytics: the processing of data to
infer patterns, correlations, or models for
prediction
• Primarily used to make business decisions
• Per individual customer
• E.g., what product to suggest for purchase
• Across all customers
• E.g., what products to manufacture/stock, in what
quantity
• Critical for businesses today
Overview (Cont.)
• Common steps in data analytics
• Gather data from multiple sources into one location
• Data warehouses also integrated data into common
schema
• Data often needs to be extracted from source formats,
transformed to common schema, and loaded into the
data warehouse
• Can be done as ETL (extract-transform-load), or ELT
(extract-load-transform)
• Generate aggregates and reports summarizing
data
• Dashboards showing graphical charts/reports
Overview (Cont.)
• Online analytical processing (OLAP) systems
allow interactive querying
• Statistical analysis using tools such as R/SAS/SPSS
• Including extensions for parallel processing of big data
• Build predictive models and use the models for
decision making
Overview (Cont.)
• Predictive models are widely used today
• E.g., use customer profile features (e.g. income,
age, gender, education, employment) and past
history of a customer to predict likelihood of
default on loan
• and use prediction to make loan decision
• E.g., use past history of sales (by season) to
predict future sales
• And use it to decide what/how much to produce/stock
• And to target customers
• Typically
• fact table joined with dimension tables and then
• group-by on dimension table attributes, and then
• aggregation on measure attributes of fact table
Multidimensional Data and
Warehouse Schemas Conti…
• Some applications do not find it worthwhile
to bring data to a common schema
• Data lakes are repositories which allow data
to be stored in multiple formats, without
schema integration
• Less upfront effort, but more effort during
querying
Database Support for Data
Warehouses
• Data in warehouses usually append only, not
updated
• Can avoid concurrency control overheads
• Data warehouses often use column-oriented
storage
• E.g., a sequence of sales tuples is stored as follows
• Values of item_id attribute are stored as an array
• Values of store_id attribute are stored as an array,
• And so on
• Arrays are compressed, reducing storage, IO and
memory costs significantly
Database Support for Data
Warehouses Conti…
• Queries can fetch only attributes that they care
about, reducing IO and memory cost
• More details in Section 13.6
• Data warehouses often use parallel storage
and query processing infrastructure
• Distributed file systems, Map-Reduce, Hive, …
OLAP
Data Analysis and OLAP
• Online Analytical Processing (OLAP)
• Interactive analysis of data, allowing data to
be summarized and viewed in different ways
in an online fashion (with negligible delay)
• We use the following relation to illustrate
OLAP concepts
• sales (item_name, color, clothes_size,
quantity)
This is a simplified version of the sales fact table
joined with the dimension tables, and many
attributes removed (and some renamed)
Example sales relation
month or year
• Drill down: The opposite operation - that of moving from
coarser-granularity data to finer-granularity data
Hierarchies on Dimensions
• Hierarchy on dimension attributes: lets
dimensions be viewed at different levels of
detail
• E.g., the dimension datetime can be used to
aggregate by hour of day, date, day of week,
month, quarter or year
Cross Tabulation With
Hierarchy
• Cross-tabs can be easily extended to deal
with hierarchies
• Can drill down or roll up on a hierarchy
• E.g. hierarchy: item_name category
Relational Representation of
Cross-tabs
• Cross-tabs can be
represented as relations
• We use the value all to
represent aggregates.
• The SQL standard
actually uses null values
in place of all
• Works with any data type
• But can cause confusion
with regular null values.
OLAP IN SQL
Pivot Operation
• select *
from sales
pivot (
sum(quantity)
for color in ('dark','pastel','white')
)
order by item name;
Cube Operation
• The cube operation computes union of group by’s on every
subset of the specified attributes
• E.g., consider the query
select item_name, color, size, sum(number)
from sales
group by cube(item_name, color, size)
This computes the union of eight different groupings of the
sales relation:
{ (item_name, color, size), (item_name, color),
(item_name, size), (color, size),
(item_name), (color),
(size), ()}
where ( ) denotes an empty group by list.
• For each grouping, the result contains the null value for
attributes not present in the grouping.
Online Analytical Processing Operations
• Relational representation of cross-tab that we saw earlier, but
with null in place of all, can be computed by
select item_name, color, sum(number)
from sales
group by cube(item_name, color)
• The function grouping() can be applied on an attribute
• Returns 1 if the value is a null value representing all,
and returns 0 in all other cases.
select case when grouping(item_name) = 1 then 'all’
else item_name end as item_name,
case when grouping(color) = 1 then 'all’
else color end as color,
'all' as clothes size, sum(quantity) as quantity
from sales
group by cube(item name, color);
Online Analytical Processing
Operations
• Can use the function decode() in the select
clause to replace such nulls by
a value such as all
• E.g., replace item_name in first query by
decode( grouping(item_name), 1, ‘all’,
item_name)
Extended Aggregation (Cont.)
• The rollup construct generates union on every prefix of specified
list of attributes
• select item_name, color, size, sum(number)
from sales
group by rollup(item_name, color, size)
Generates union of four groupings:
{ (item_name, color, size), (item_name, color), (item_name),
()}
● Rollup can be used to generate aggregates at multiple levels of a
hierarchy.
● E.g., suppose table itemcategory(item_name, category) gives the
category of each item. Then
select category, item_name, sum(number)
from sales, itemcategory
where sales.item_name = itemcategory.item_name
group by rollup(category, item_name)
would give a hierarchical summary by item_name and by
Extended Aggregation (Cont.)
• Multiple rollups and cubes can be used in a single group by clause
• Each generates set of group by lists, cross product of sets gives
overall set of group by lists
• E.g.,
select item_name, color, size, sum(number)
from sales
group by rollup(item_name), rollup(color, size)
generates the groupings
{item_name, ()} X {(color, size), (color), ()}
= { (item_name, color, size), (item_name, color),
(item_name),
(color, size), (color), ( ) }
• select item_name, color, clothes_size, sum(quantity)
from sales
group by grouping sets ((color, clothes_size),
(clothes_size, item_name));
OLAP Implementation
• The earliest OLAP systems used
multidimensional arrays in memory to store
data cubes, and are referred to as
multidimensional OLAP (MOLAP) systems.
• OLAP implementations using only relational
database features are called relational OLAP
(ROLAP) systems
• Hybrid systems, which store some summaries
in memory and store the base data and other
summaries in a relational database, are called
hybrid OLAP (HOLAP) systems.
OLAP Implementation (Cont.)
• Early OLAP systems precomputed all possible aggregates in order
to provide online response
• Space and time requirements for doing so can be very high
• 2n combinations of group by
• It suffices to precompute some aggregates, and compute
others on demand from one of the precomputed aggregates
• Can compute aggregate on (item_name, color) from an
aggregate on (item_name, color, size)
• For all but a few “non-decomposable” aggregates such
as median
• is cheaper than computing it from scratch
⇒ P.credit =
excellent
• ∀ person P, P.degree = bachelors and
at a time
• Output is computed using current
weights
• If classification is wrong, weights
are tweaked to get a higher score
for the correct class
Neural Networks (Cont.)
• Deep neural networks have a large number of
layers with large number of nodes in each layer
• Deep learning refers to training of deep neural
network on very large numbers of training
instances
• Each layer may be connected to previous layers in
different ways
• Convolutional networks used for image
processing
• More complex architectures used for text
processing, and machine translation, speech
recognition, etc.
Regression
• Regression deals with the prediction of a value, rather than a
class.
• Given values for a set of variables, X , X , …, X , we wish
1 2 n
to predict the value of a variable Y.
• One way is to infer coefficients a0, a1, a1, …, an such that
Y = a 0 + a 1 * X1 + a 2 * X2 + … + a n * Xn
• Finding such a linear polynomial is called linear regression.
• In general, the process of finding a curve that fits the data
chordata
mammalia reptilia
• Information-gain ratio =
Information-gain (S, {S1, S2, ……, Sr})
pass
• If memory not enough to hold all counts for all itemsets use
multiple passes, considering only some itemsets in each pass.
• Optimization: Once an itemset is eliminated because its count
(support) is too small none of its supersets needs to be
considered.
• The a priori technique to find large itemsets:
• Pass 1: count support of all sets with just 1 item. Eliminate