Data Warehousing & Modeling: Module - 2
Data Warehousing & Modeling: Module - 2
Modeling
Subject Code: 18CS641
Module -2
Module -2
Module-2:
Data warehouse implementation & Data mining: Efficient
Data Cube computation: An overview, Indexing OLAP Data:
Bitmap index and join index, Efficient processing of OLAP
Queries, OLAP server Architecture ROLAP versus MOLAP
Versus HOLAP. : Introduction: What is data mining,
Challenges, Data Mining Tasks, Data: Types of Data, Data
Quality, Data Preprocessing, Measures of Similarity and
Dissimilarity.
Textbook 2: Ch.4.4
Textbook 1: Ch.1.1,1.2,1.4, 2.1 to 2.4
Data Warehouse Implementation
🞂Data warehouse contain huge amounts of data.
🞂OLAP Servers must return decision support queries in
order of seconds.
🞂So it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access
methods and query processing techniques.
Efficient Data Cube Computation: An
Overview
🞂 In multi dimensional data analysis , it is efficient to
compute aggregations.
🞂 In SQL terms,aggregations are referred as: group-by’s.
🞂 Each group-by can be represented by a cuboid
🞂 Set of group-by’s forms a lattice of cuboids defining a
data cube.
🞂 Issues related to the efficient computations of data
cubes are as followed:
Efficient Data Cube Computation: “The compute
cube Operator”
◦ {(city,item,year),
◦ (city,item), (city,year),(item,year)
◦ (city),(item),(year),
◦ ()}
● Least generalized(most
specific) cuboid
● Rolling up
Efficient Data Cube Computation: “The compute
cube” Operator
New York
10 11 12 3 10 1 47
City
Chicago 11 9 6 9 6 7 48
Toronto 12 9 8 5 7 3 44
Vancouver 13 8 10 5 6 3 45
All
46 37 36 22 29 14 184
Item cuboid
🞂Bitmap Index
🞂Join Index
Indexing OLAP Data: Bitmap Index
🞂 The bitmap index is an alternative representation of the record ID
(RID) list.
🞂 If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the corresponding
row of the bitmap index. All other bits for that row are set to 0.
Bitmap Index Advantages
🞂It is efficient compared to hash and tree indices.
Example:
🞂
Suppose that we define a data cube for AllElectronics of the form:
🞂
The dimension hierarchies used are:
🞂“day < month < quarter < year” for time
🞂“item name < brand < type” for item
🞂“street < city < province or state < country” for location.
Efficient Processing of OLAP Queries
🞂 The data cube contains all the possible answers to a given range of
questions.
🞂It includes:
🞂Fusing data from multiple sources
🞂cleaning data to remove noise and duplicate observations
🞂selecting records and features that are relevant to the data
mining task at hand.
Post processing
🞂Ensures that only valid and useful results are
incorporated into the Decision Support System(DSS).
g
el in
o d
e M
Clu
st erin
Data d ic tiv
e
g Pr
An
on De oma
ati tec ly
soc i tio
As s n
le
Ru
Milk
1. Predictive Modeling
🞂 Refers to the task of building a model for the target variable as
the function of the explanatory variables.
Class
Classification
🞂 Given a collection of records (training set )
🞂 Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Classification Example
l l e
r ica ir ca a tiv
ti t
go g o
an ss
a te a te u a
c c q cl
Test
Set
Training
Learn
Set Classifier Model
2. Cluster Analysis
🞂 Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Example: Document Clustering
Rules Discovered:
{Milk} --> {Bread}
{Diapers} --> {Milk}
{Diaper, Milk} --> {Coke}
Example:
🞂
Finding groups of genes that have related functionality
🞂
Identifying Web pages that are accessed together
🞂
Understanding the relationships between different elements of Earth's
climate system.
4.Deviation/Anomaly/Change Detection
🞂 Task of identifying observations
whose characteristics are
significantly different from the
rest of the data.
Objects
dimension, or feature
🞂 A collection of attributes describe an
object
◦ Object is also known as record,
point, case, event, vector, pattern,
observation, sample, entity, or
instance
DataSet- Example
● row-object-Student
● column-attribute-aspect of a student
● record based datasets- flat files or relational database systems
Attributes and Measurement
🞂 An attribute is a property or characteristic of an object that
may vary; either from one object to another or from one time
to another.
◦ Distinctness: = ≠
◦ Order: < >
◦ Differences are + -
meaningful :
◦ Ratios are * /
meaningful
Types are:
◦ Nominal attribute: distinctness
◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful differences
◦ Ratio attribute: all 4 properties/operations
Example:length
Types of Attributes
🞂 There are different types of attributes
🞂Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight.
◦ Practically, real values can only be measured and represented
using a finite number of digits.
◦ Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
🞂Only presence -a non-zero attribute value is regarded as
important.
◦ Distribution
◦ frequency of occurrence of various values for the attribute of
data objects
◦ Statisticians enumerated distributions like: Gaussian(Normal)
and properties
◦ Many data sets not well captured so not analyze statistical
distribution.
◦ Skewness in the distribution makes classification difficult-
Categorical attribute→ Y-95%,N-5%
Important Characteristics of Data Set
◦ Sparsity
● Only the non-zero values need to be stored and manipulated
which improves computation time and storage.
◦ Resolution
● Possible to obtain data at different levels of resolution, and often
the properties of the data are different at different resolution.
● Ex: Surface of earth uneven at resolution of few meters and
relatively smooth at at tens of kilometers
● Patterns should not be too fine or too coarse, it would not be
visible.
● Ex: Atmospheric pressure at a scale of hours reflect the
movement of storms and other weather systems. On scale of
months, such a phenomena not detectable.
Types of data sets
🞂 Record Data
◦ Transaction Data (Market Basket Data)
◦ Data Matrix(Pattern Matrix)
◦ Sparse Data Matrix (Document-term Data Matrix)
🞂 Graph-Based Data
◦ World Wide Web (Data with Relationships among Objects)
◦ Molecular Structures (Data with Objects that are Graphs)
🞂 Ordered Data
◦ Sequential Transaction Data
◦ Genomic Sequence Data
◦ Temperature time series data
◦ Spatial Temperature Data
Record Data
🞂
Data that consists of a collection of records, each of which consists of a fixed set
of attributes
🞂
Stored in flat files or relational databases
Transaction Data
🞂A special type of record data, where
◦ Each record (transaction) involves a set of items.
◦ For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
Data Matrix
🞂 If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
Items/Eents
Genomic sequence data
Average
Monthly
Temperature of
land and
ocean
Spatio-Temporal Data
Handling Non-Record Data
🞂Most of the data mining algorithms for record data or
its variations.
(2) The use of robust algorithms that can tolerate poor data
quality.
Measurement and Data Collection Issues
🞂
The term data collection error refers to errors such as :Omitting data
objects or attribute values or inappropriately including a data object.
Example:Study of animals of certain species→ related species in
appearance
🞂
Problems that involve measurement error are:
◦Noise, artifacts,bias,precision and accuracy
◦Data Quality issues that involve measurement and data collection problems
are: outliers, Missing values, Inconsistent values and Duplicate data
Noise
🞂 Noise is the random component of a measurement error.
🞂 It may involve the distortion of a value or the addition of
spurious objects.
🞂 Noise is used with the data that has spatial or temporal
component such as from signal or image processing data.
Artifacts
🞂Data errors may be the result of a more deterministic
phenomenon, such as a streak in the same place on a
set of photographs.
🞂 Bias: A systematic
variation of the measurements from the
quantity being measured.
🞂Examples:
◦ Same person with multiple email addresses
◦ Two distinct persons and address but with same name
→deduplication→Not similar
Data quality Issues Related to Applications
🞂Aggregation
🞂Sampling
🞂Dimensionality reduction
🞂Feature subset selection
🞂Feature creation
🞂Discretization and Binarization
🞂Variable transformation
Aggregation
● Combining two or more attributes (or objects) into a
single attribute (or object)
● Purpose
– Data reduction
◆ Reduce the number of attributes or objects
– Change of scale
◆ Cities aggregated into regions, states, countries, etc.
◆ Days aggregated into weeks, months, or years
– More “stable” data
◆ Aggregated data(avg,total) tends to have less
variability than individual values being aggregated.
Aggregation
Standard Deviation of
Standard Deviation of Average
Average Yearly Precipitation
Monthly Precipitation
● ●
Ten groups of
points
🞂Redundant features
◦ Duplicate much or all of the information contained in one or more
other attributes
◦ Example: purchase price of a product and the amount of sales tax
paid contain much of the same information.
🞂Irrelevant features
◦ Contain no information that is useful for the data mining task at
hand
◦ Example: students' ID is often irrelevant to the task of predicting
students' GPA
An Architecture for Feature Subset
Selection
Feature Subset Selection
🞂 Similarity measure
◦ Numerical measure of how alike two data objects are.
◦ Is higher for pairs of objects are more alike.
◦ Often falls in the range [0,1]
◦ 0→ No similarity, 1→ Complete Similar
🞂 Euclidean Distance
🞂 r = 2. Euclidean distance
Example: Find students who had answered questions similarly on a test has True/False questions
Jaccard Coefficient
🞂 Used to handle objects consisting of asymmetric binary attributes.
🞂 Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix.
● Dividing by x and y by their length vector means normalizes to have length 1. Means, does not take
length of the 2 data objects into account when computes similarity.
● When length is important then computation using Euclidean Distance is better choice.
Extended Jaccard Coefficient (Tanimoto
Coefficient )
30 pairs of values
randomly generated
with normal distribution,
Scatter plots showing
the similarity from –1
to 1.
● If x and y are transformed
as x’ and y’
● cos(45) != cos(10)
● Correlation between 2
vectors are equal when
cos(x,y)=0 and
cos(x’,y’)=0
Drawback of Correlation
🞂x = (-3, -2, -1, 0, 1, 2, 3)
🞂y = (9, 4, 1, 0, 1, 4, 9)
🞂mean(x) = 0, mean(y) = 4
🞂std(x) = 2.16, std(y) = 3.74
Nonlinear relationships- If corr=0, then no linear relationship between the 2 set of values.
Though yi = x i2 correlation is 0
Bregman Divergence
● Used as loss or distortion functions.
● Assume: x and y → 2 points, y is original point and x is distortion
or approximation of original point.
● Resulting distortion or loss that results if y is approximated to y
(smaller the loss) or not.
● Also used as dissimilarity functions.
● Mathematically from vector calculus, Formal definition can be:
convex function and uses Taylor’s
expansion
Issues in Proximity Calculation
● Issues related to proximity measures are:
🞂First, the type of proximity measure should fit the type of data.
For many types of dense, continuous data, metric distance
measures such as Euclidean distance are often used.