Data-Preprocessing
Data-Preprocessing
Pukar Karki
Assistant Professor
[email protected]
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Data Objects
●
Data sets are made up of data objects.
●
A data object represents an entity.
●
Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
●
Also called samples , examples, instances, data points, objects, tuples.
●
Data objects are described by attributes.
●
Database rows -> data objects; columns ->attributes.
4
Attributes
●
Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
– E.g. customer _ID, name, address
●
Types:
– Nominal
– Binary
– Numeric:
- Quantitative
- Interval-scaled
- Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive
values is not known.
Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts, monetary
quantities
7
Discrete vs. Continuous Attributes
Discrete Attribute
●
Has only a finite or countably infinite set of values
– E.g., zip codes, profession, or the set of words in a collection of documents
●
Sometimes, represented as integer variables
●
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute
●
Has real numbers as attribute values
– E.g., temperature, height, or weight
●
Practically, real values can only be measured and represented using a finite
number of digits
●
Continuous attributes are typically represented as floating-point variables
8
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔
The most common and effective numeric measure of the “center” of a set of
data is the (arithmetic) mean.
✔
Let x1,x2,...,xN be a set of N values or observations, such as for some numeric
attribute X, like salary.
✔
The mean of this set of values is
9
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔ Sometimes, each value xi in a set may be associated with a weight wi for i = 1,...,N..
✔
The weights reflect the significance, importance, or occurrence frequency attached to
their respective values.
✔
In this case, we can compute
✔
This is called the weighted arithmetic mean or the weighted average.
10
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔
Although the mean is the singlemost useful quantity for describing a data set, it is not always the
best way of measuring the center of the data.
✔
A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
✔
Even a small number of extreme values can corrupt the mean.
✔
For skewed (asymmetric) data, a better measure of the center of data is the median.
11
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔
The mode is another measure of central tendency.
✔
The mode for a set of data is the value that occurs most frequently in the set and can be
determined for qualitative and quantitative attributes.
✔
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal.
✔
In general, a data set with two or more modes is multimodal.
✔
At the other extreme, if each data value occurs only once, then there is no mode.
12
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following
empirical relation:
13
Basic Statistical Descriptions of Data
Measuring the Dispersion of Data: Quartiles
A plot of the data distribution for some attribute X. The quantiles plotted are quartiles.
The three quartiles divide the distribution into four equal-size consecutive subsets. The
second quartile corresponds to the median.
14
Basic Statistical Descriptions of Data
Five-Number Summary, Boxplots, and Outliers
✔ The five-number summary of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
✔
The ends of the box are at the quartiles so that the box
length is the interquartile range.
✔
The median is marked by a line within the box.
✔
Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum)
observations.
15
Basic Statistical Descriptions of Data
Measuring the Dispersion of Data: Variance, Standard Deviation
✔
Variance and standard deviation are measures of data dispersion.
✔
They indicate how spread out a data distribution is.
✔
A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values.
16
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
18
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
19
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
20
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred
21
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or
decision tree
22
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
23
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by
outliers)
24
25
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels)
26
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units
27
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different
names in different databases
Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
28
Correlation Analysis (Nominal Data)
●
Χ2 (chi-square) test
●
The larger the Χ2 value, the more likely the variables are related
●
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
●
Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population 29
Q) Correlation analysis of nominal attributes using χ2. Suppose that a
group of 1500 people was surveyed. The gender of each person was noted.
Each person was polled as to whether his or her preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender
and preferred reading. The observed frequency (or count) of each possible
joint event is summarized in the contingency table shown below where the
numbers in parentheses are the expected frequencies.
✔
The expected frequencies are calculated
based on the data distribution for both
attributes using
✔
For example, the expected frequency for
the cell (male, fiction) is
30
✔
For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1.
✔
For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the 0.001 significance level is
10.828.
✔
Since our computed value is above this, we can reject the hypothesis that gender and preferred reading
are independent and conclude that the two attributes are (strongly) correlated for the given group of
people.
31
Correlation Analysis (Numeric Data)
●
Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
● If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
● rA,B = 0: independent; rAB < 0: negatively correlated
32
Covariance (Numeric Data)
The covariance between A and B is defined as
If we compare rA,B (correlation coefficient) with covariance, we see that
It can also be shown that
33
Covariance (Numeric Data)
● Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
● Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
●
Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
34
Co-Variance: An Example
Suppose two stocks A and B have the following values in one
week:
Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
●
Therefore, given the positive covariance we can say that
stock prices for both companies rise together.
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Principal Components Analysis (PCA)
Wavelet transforms
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
36
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
37
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in dimensionality
reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors
define the new space.
x2
x1
38
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal component vectors
The principal components are sorted in order of decreasing “significance” or strength
Since the components are sorted, the size of the data can be reduced by eliminating
the weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
Works for numeric data only
39
Principal Component Analysis (Steps)
Step 1 - Data normalization
By considering the example in the introduction, let’s consider, for
40
Principal Component Analysis (Steps)
Step 2 - Covariance matrix computation
As the same suggests, this step is about computing the covariable
41
Principal Component Analysis (Steps)
Step 3 - Eigenvectors and eigenvalues
Geometrically, an eigenvector represents a direction such as “vertical”
or “90 degrees”.
An eigenvalue, on the other hand, is a number representing the
42
Principal Component Analysis (Steps)
Step 4 - Selection of principal components
There are as many pairs of eigenvectors and eigenvalues as the
three pairs.
Not all the pairs are relevant.
43
Principal Component Analysis (Steps)
Step 5 - Data transformation in new dimensional space
This step involves re-orienting the original data onto a new subspace
the original data itself but instead provides a new perspective to better
represent the data.
44
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one or more other
attributes
E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
Contain no information that is useful for the data mining task at hand
E.g., students' ID is often irrelevant to the task of predicting students'
GPA
45
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
Three general methodologies
Attribute extraction
Domain-specific
Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
Attribute construction
Combining features
Data discretization
46
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data
representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling, …
47
Parametric Data Reduction: Regression
and Log-Linear Models
Linear regression
Data modeled to fit a straight line
Multiple regression
Allows a response variable Y to be modeled as a linear function of
48
Regression Analysis
Regression analysis: A collective name for techniques for the modeling and
analysis of numerical data consisting of values of a dependent variable (also
called response variable or measurement) and of one or more independent
variables (aka. explanatory variables or predictors).
49
50
Regression Analysis
The parameters are estimated
so as to give a "best fit" of
the data.
Used for prediction (including
forecasting of time-series
data), inference, hypothesis
testing, and modeling of
causal relationships.
Y = 0.6951*X + 0.2993
51
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be estimated by
using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional space for a set
of discretized attributes, based on a smaller subset of dimensional combinations
Useful for dimensionality reduction and data smoothing
52
Histogram Analysis
A histogram for an attribute, A,
partitions the data distribution of A into
disjoint subsets, referred to as buckets
or bins.
If each bucket represents only a single
attribute–value/frequency pair, the
buckets are called singleton buckets.
Often, buckets instead represent
continuous ranges for the given
attribute.
53
Histogram Analysis
Often, buckets instead represent continuous ranges for the given attribute.
54
Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”.
Can have hierarchical clustering and be stored in multi-dimensional index
tree structures
There are many choices of clustering definitions and clustering algorithms
55
Sampling
Sampling: obtaining a small sample s to represent the whole data set N
Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor performance in the presence
of skew
Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)
56
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular item
Sampling without replacement
Once an object is selected, it is removed from the population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
Used in conjunction with skewed data
57
Sampling: Cluster or Stratified Sampling
58
Data Reduction 3: Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without reconstructing
the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be considered as forms of data
compression
59
Data Compression
lossless
Original Data
Approximated
60
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
61
Normalization
Min-max normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute, A.
Min-max normalization maps a value, vi, of A to vi′ in the range [new_minA,new_maxA] by
computing
Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. We would like to map income to the range [0.0,1.0]. By min-max
normalization, a value of $73,600 for income is transformed to
62
Normalization
In z-score normalization (or zero-mean normalization), the values for an
attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A.
A value, vi, of A is normalized to vi′ by computing
Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for
income is transformed to
63
Normalization
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.
The number of decimal points moved depends on the maximum absolute
value of A.
A value, vi, of A is normalized to vi′ by computing
67
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
68
Discretization Without Using Class Labels
(Binning vs. Clustering)
71
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes explicitly at the schema level by
users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the analysis of the number
of distinct values
E.g., for a set of attributes: {street, city, state, country}
72
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest
75
Data Cube: A Multidimensional Data Model
Example: AllElectronics wants to keep track of the it’s store’s sales with
respect to the time, item, and location.
76
Data Cube: A Multidimensional Data Model
Example: AllElectronics wants to keep track of the it’s store’s sales with
respect to the time, item, and location.
77
Data Cube: A Multidimensional Data Model
A data cube allows data to be
modeled and viewed in multiple
dimensions. It is defined by
dimensions and facts.
In general terms, dimensions are the
perspectives or entities with respect
to which an organization wants to
keep records.
78
Data Cube: A Multidimensional Data Model
Each dimension may have a
table associated with it, called a
dimension table, which further
describes the dimension.
For example, a dimension table
for item may contain the
attributes item name, brand,
and type.
Dimension tables can be
specified by users or experts, or
automatically generated and
adjusted based on data
distributions. 79
Data Cube: A Multidimensional Data Model
A multidimensional data model is
typically organized around a central
theme, such as sales.
This theme is represented by a fact
table which are numeric measures.
Examples of facts for a sales data
warehouse include dollars sold
(sales amount in dollars), units sold
(number of units sold), and amount
budgeted.
80
Data Cube: A Multidimensional Data Model
The cuboid that holds the lowest level of summarization is called the base cuboid.
81
Data Cube: A Multidimensional Data Model
The 0-D cuboid, which holds the highest level of summarization, is
called the apex cuboid.
In our example, this is the total sales, or dollars sold, summarized over
all four dimensions.
The apex cuboid is typically denoted by all.
82
Schemas for Multidimensional Database
The entity-relationship data model is commonly used in the design of
relational databases, where a database schema consists of a set of
entities and the relationships between them.
Such a data model is appropriate for online transaction processing.
A data warehouse, however, requires a concise, subject-oriented
schema that facilitates online data analysis.
83
Schemas for Multidimensional Database
The most popular data model for a data warehouse is a
multidimensional model, which can exist in the form of a star
schema, a snowflake schema, or a fact constellation schema.
84
Schemas for Multidimensional Database-Star Schema
The most common modeling paradigm is the star schema, in which the
data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.
The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.
85
Example of Star Schema
Schemas for Multidimensional Database-Snowflake Schema
The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the
data into additional tables.
The resulting schema graph forms a shape similar to a snowflake.
87
Example of Snowflake Schema
Star Schema Vs. Snowflake Schema
The major difference between the snowflake and star schema models
is that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.
Such a table is easy to maintain and saves storage space.
However, this space savings is negligible in comparison to the typical
magnitude of the fact table.
89
Star Schema Vs. Snowflake Schema
Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query.
Consequently, the system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not
as popular as the star schema in data warehouse design.
90
Schemas for Multidimensional Database- Fact Constellation Schema
Sophisticated applications may require multiple fact tables to share
dimension tables.
This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
91
Example of Fact Constellation
Dimensions: The Role of Concept Hierarchies
✔
A concept hierarchy defines a sequence of mappings from a set of low-
level concepts to higher-level, more general concepts.
✔
Consider a concept hierarchy for the dimension location
✔
City values for location include Vancouver, Toronto, New York, and
Chicago.
Dimensions: The Role of Concept Hierarchies
Dimensions: The Role of Concept Hierarchies
111
Data Matrix and Dissimilarity Matrix
●
Data matrix (or object-by-attribute structure):
- This structure stores the n data objects in the form of
a relational table, or n-by-p matrix (n objects × p
attributes).
●
Dissimilarity matrix(or object-by-object structure):
- This structure stores a collection of proximities that
are available for all pairs of n objects.
- It is often represented by an n-by-n table:
112
Example: Data Matrix and Dissimilarity Matrix
x2 x4 Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
x3
(with Euclidean Distance)
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
113
Proximity Measure for Nominal Attributes
●
Can take 2 or more states, e.g., red, yellow, blue, green (generalization of
a binary attribute)
●
Method 1: Simple matching
– m: # of matches, p: total # of variables
●
Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M nominal states
114
Proximity Measure for Nominal Attributes
Example : Compute the dissimilarity matrix for given data.
✔
Since here we have one nominal attribute, test-1, we set p = 1
✔
d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ.
115
Proximity Measure for Binary Attributes
✔
A binary attribute has only one of two states: 0 and 1, where 0 means that the attribute is
absent, and 1 means that it is present.
✔
To compute the dissimilarity between two binary attributes we can compute a
dissimilarity matrix from the given binary data.
116
Proximity Measure for Binary Attributes
✔
Dissimilarity that is based on symmetric binary attributes is called symmetric binary
dissimilarity. If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is
117
Proximity Measure for Binary Attributes
✔
For asymmetric binary attributes, the two states are not equally important, such as the
positive (1) and negative (0) outcomes of a disease test.
✔
The dissimilarity based on these attributes is called asymmetric binary dissimilarity,
where the number of negative matches, t, is considered unimportant and is thus ignored
in the following computation:
118
Proximity Measure for Binary Attributes
✔
Complementarily, we can measure the difference between two binary attributes based
on the notion of similarity instead of dissimilarity. For example, the asymmetric binary
similarity between the objects i and j can be computed as
✔
The coefficient sim(i, j) of is called the Jaccard coefficient.
119
Dissimilarity between Binary Variables
●
Example
120
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric.
121
Special Cases of Minkowski Distance
● h = 1: Manhattan (city block, L1 norm) distance
– E.g., the Hamming distance: the number of bits that are different between two
binary vectors
122
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2
Supremum
x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
123
Ordinal Variables
✔
The values of an ordinal attribute have a meaningful order or ranking about them, yet
the magnitude between successive values is unknown.
✔
Let M represent the number of possible states that an ordinal attribute can have.
These ordered states define the ranking 1, . . . , Mf .
✔
Suppose that f is an attribute from a set of ordinal attributes describing n objects. The
dissimilarity computation with respect to f involves the following steps:
124
Ordinal Variables
Dissimilarity can then be computed using any of the distance measures for numeric
attributes, using zif to represent the f value for the ith object.
125
Ordinal Variables ✔
Suppose that we have the sample data as shown, except
that this time only the object-identifier and the continuous
Example ordinal attribute, test-2, are available.
✔
There are three states for test-2: fair, good, and excellent,
that is, Mf = 3.
✔
If we replace each value for test-2 by its rank, the four
objects are assigned the ranks 3, 1, 2, and 3, respectively.
✔
We then normalize the ranking by mapping rank 1 to 0.0,
rank 2 to 0.5, and rank 3 to 1.0.
✔
Finally, we can use the Euclidean distance, which results in
the following dissimilarity matrix:
126
Attributes of Mixed Type
●
A database may contain all attribute types
– Nominal, symmetric binary, asymmetric binary, numeric, ordinal
●
One may use a weighted formula to combine their effects
127
Attributes of Mixed Type
Example
For test-I For test-2
✔
We can now use the dissimilarity matrices for the three
For test-3 attributes
✔
The indicator δij(f) = 1 for each of the three attributes, f .
✔
For example
✔
The resulting dissimilarity matrix obtained is
128
Cosine Similarity
●
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
●
Other vector objects: gene features in micro-arrays, …
●
Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
●
Cosine measure:
●
where ||x|| is the Euclidean norm of vector x = (x1, x2,..., xp), defined as
129
Example: Cosine Similarity
Suppose that x and y are the first two
term-frequency vectors as shown.
That is,
X = (5,0,3,0,2,0,0,2,0,0)
Y = (3,0,2,0,1,1,0,1,0,1).
How similar are x and y?
130
Simple Matching Coefficient
The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for
comparing the similarity and diversity of sample sets.
Given two objects, A and B, each with n binary attributes, SMC is defined as:
where:
M00 is the total number of attributes where A and B both have a value of 0.
M11 is the total number of attributes where A and B both have a value of 1.
M01 is the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
M10 is the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
The simple matching distance (SMD), which measures dissimilarity between sample sets, is given by 1 − SMC
131
Review Question
1) Explain data warehouse architecture with it’s analytical processing.
2) Suppose that a data warehouse consists of the four dimensions data,
spectator, location, and game, and the two measures count and charge, where
charge is the fare that a spectator pays when watching a game on a given
date. Spectators may be students, adults or seniors, with each category having
its own charge rate.
a) Draw a star schema diagram for the data warehouse.
b) Starting with the base cuboid [data, spectator, location, game], what specific
OLAP operations should you perform in order to list the total charge paid by
student spectators at Dashrath Stadium in 2021?
Review Question
6) Use the following methods to normalize the data: 200, 300, 400, 600
and 1000.
a) Min-max normalization by setting min=0 and max=1.
b) Z-score normalization.
c) Normalization by decimal scaling.
Review Question
7) Find the principal components and the proportion of the total variance
explained by each when the covariance matrix of the three random
variables X1, X2 and X3 is :
Review Question
8) Given the following points compute the distance matrix using the
Manhattan and the supermum distance.
Review Question
8) Given the following two vectors compute the Cosine similarity between
them.
D1 = [ 4 0 2 0 1]
D2 = [ 2 0 0 2 2]
9) Given the following two binary vectors compute the Jaccard similarity
and Simple Matching Coefficient.
P = [ 0 0 1 1 0 1]
Q = [1 1 1 1 0 1]
Review Question
10) Why data preprocessing is necessary? Explain the methods for data
preprocessing to maintain data quality.
11) What is data pre-processing? Explain data sampling and
dimensionality reduction in data pre-processing with suitable example.
12) What are the approaches to handle missing data?
13) What are the measuring elements of data quality? Explain differnet
data transformation by normalization methods with an example.