0% found this document useful (0 votes)
11 views

Data-Preprocessing

Data Mining IOE - Chapter 2 Notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data-Preprocessing

Data Mining IOE - Chapter 2 Notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

2.

Data Preprocessing(6 Hrs)

Pukar Karki
Assistant Professor
[email protected]
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Data Objects

Data sets are made up of data objects.

A data object represents an entity.

Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses

Also called samples , examples, instances, data points, objects, tuples.

Data objects are described by attributes.

Database rows -> data objects; columns ->attributes.
4
Attributes

Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
– E.g. customer _ID, name, address

Types:
– Nominal
– Binary
– Numeric:
- Quantitative
- Interval-scaled
- Ratio-scaled

5
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between successive
values is not known.
 Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts, monetary
quantities
7
Discrete vs. Continuous Attributes
Discrete Attribute

Has only a finite or countably infinite set of values
– E.g., zip codes, profession, or the set of words in a collection of documents

Sometimes, represented as integer variables

Note: Binary attributes are a special case of discrete attributes
Continuous Attribute

Has real numbers as attribute values
– E.g., temperature, height, or weight

Practically, real values can only be measured and represented using a finite
number of digits

Continuous attributes are typically represented as floating-point variables
8
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode

The most common and effective numeric measure of the “center” of a set of
data is the (arithmetic) mean.

Let x1,x2,...,xN be a set of N values or observations, such as for some numeric
attribute X, like salary.

The mean of this set of values is

9
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode
✔ Sometimes, each value xi in a set may be associated with a weight wi for i = 1,...,N..

The weights reflect the significance, importance, or occurrence frequency attached to
their respective values.

In this case, we can compute


This is called the weighted arithmetic mean or the weighted average.
10
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode

Although the mean is the singlemost useful quantity for describing a data set, it is not always the
best way of measuring the center of the data.

A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.

Even a small number of extreme values can corrupt the mean.

For skewed (asymmetric) data, a better measure of the center of data is the median.

11
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode

The mode is another measure of central tendency.

The mode for a set of data is the value that occurs most frequently in the set and can be
determined for qualitative and quantitative attributes.

Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal.

In general, a data set with two or more modes is multimodal.

At the other extreme, if each data value occurs only once, then there is no mode.

12
Basic Statistical Descriptions of Data
Measuring the Central Tendency: Mean, Median, and Mode

For unimodal numeric data that are moderately skewed (asymmetrical), we have the following
empirical relation:

13
Basic Statistical Descriptions of Data
Measuring the Dispersion of Data: Quartiles

A plot of the data distribution for some attribute X. The quantiles plotted are quartiles.
The three quartiles divide the distribution into four equal-size consecutive subsets. The
second quartile corresponds to the median.

14
Basic Statistical Descriptions of Data
Five-Number Summary, Boxplots, and Outliers
✔ The five-number summary of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.

The ends of the box are at the quartiles so that the box
length is the interquartile range.

The median is marked by a line within the box.

Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum)
observations.

15
Basic Statistical Descriptions of Data
Measuring the Dispersion of Data: Variance, Standard Deviation


Variance and standard deviation are measures of data dispersion.

They indicate how spread out a data distribution is.

A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values.

16
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

18
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
19
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
20
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred

21
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or
decision tree
22
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
23
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by

bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with possible

outliers)
24
25
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
26
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are different
 Possible reasons: different representations, different scales, e.g., metric vs. British units

27
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
 Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

28
Correlation Analysis (Nominal Data)

Χ2 (chi-square) test


The larger the Χ2 value, the more likely the variables are related

The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count

Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population 29
Q) Correlation analysis of nominal attributes using χ2. Suppose that a
group of 1500 people was surveyed. The gender of each person was noted.
Each person was polled as to whether his or her preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender
and preferred reading. The observed frequency (or count) of each possible
joint event is summarized in the contingency table shown below where the
numbers in parentheses are the expected frequencies.

The expected frequencies are calculated
based on the data distribution for both
attributes using


For example, the expected frequency for
the cell (male, fiction) is

30

For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1.

For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the 0.001 significance level is
10.828.

Since our computed value is above this, we can reject the hypothesis that gender and preferred reading
are independent and conclude that the two attributes are (strongly) correlated for the given group of
people.
31
Correlation Analysis (Numeric Data)

Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
● If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
● rA,B = 0: independent; rAB < 0: negatively correlated

32
Covariance (Numeric Data)

The covariance between A and B is defined as


If we compare rA,B (correlation coefficient) with covariance, we see that

where σA and σB are the standard deviations of A and B, respectively.


It can also be shown that

33
Covariance (Numeric Data)
● Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
● Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.

34
Co-Variance: An Example

Suppose two stocks A and B have the following values in one
week:

Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?


Therefore, given the positive covariance we can say that
stock prices for both companies rise together.
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Principal Components Analysis (PCA)
 Wavelet transforms
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
36
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
37
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting in dimensionality
reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors
define the new space.
x2

x1
38
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal component vectors
 The principal components are sorted in order of decreasing “significance” or strength
 Since the components are sorted, the size of the data can be reduced by eliminating
the weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
 Works for numeric data only

39
Principal Component Analysis (Steps)
Step 1 - Data normalization
 By considering the example in the introduction, let’s consider, for

instance, the following information for a given client.


Monthly expenses: $300 Age: 27 Rating: 4.5
 This information has different scales and performing PCA using such

data will lead to a biased result.


 This is where data normalization comes in. It ensures that each

attribute has the same level of contribution, preventing one variable


from dominating others.

40
Principal Component Analysis (Steps)
Step 2 - Covariance matrix computation
 As the same suggests, this step is about computing the covariable

matrix from the normalized data.


 This is a symmetric matrix, and each element (i, j) corresponds to the

covariance between variables i and j.

41
Principal Component Analysis (Steps)
Step 3 - Eigenvectors and eigenvalues
 Geometrically, an eigenvector represents a direction such as “vertical”

or “90 degrees”.
 An eigenvalue, on the other hand, is a number representing the

amount of variance present in the data for a given direction.


 Each eigenvector has its corresponding eigenvalue.

42
Principal Component Analysis (Steps)
Step 4 - Selection of principal components
 There are as many pairs of eigenvectors and eigenvalues as the

number of variables in the data.


 In the data with only monthly expenses, age, and rate, there will be

three pairs.
 Not all the pairs are relevant.

 So, the eigenvector with the highest eigenvalue corresponds to the

first principal component.


 The second principal component is the eigenvector with the second

highest eigenvalue, and so on.

43
Principal Component Analysis (Steps)
Step 5 - Data transformation in new dimensional space
 This step involves re-orienting the original data onto a new subspace

defined by the principal components.


 This reorientation is done by multiplying the original data by the

previously computed eigenvectors.


 It is important to remember that this transformation does not modify

the original data itself but instead provides a new perspective to better
represent the data.

44
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or more other
attributes
 E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining task at hand
 E.g., students' ID is often irrelevant to the task of predicting students'
GPA

45
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
 Three general methodologies
 Attribute extraction

 Domain-specific
 Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
 Attribute construction
 Combining features
 Data discretization

46
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data
representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

47
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line

 Often uses the least-square method to fit the line

 Multiple regression
 Allows a response variable Y to be modeled as a linear function of

multidimensional feature vector


 Log-linear model
 Approximates discrete multidimensional probability distributions

48
Regression Analysis
 Regression analysis: A collective name for techniques for the modeling and
analysis of numerical data consisting of values of a dependent variable (also
called response variable or measurement) and of one or more independent
variables (aka. explanatory variables or predictors).

49
50
Regression Analysis
 The parameters are estimated
so as to give a "best fit" of
the data.
 Used for prediction (including
forecasting of time-series
data), inference, hypothesis
testing, and modeling of
causal relationships.

Y = 0.6951*X + 0.2993
51
Regress Analysis and Log-Linear Models
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be estimated by
using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional space for a set
of discretized attributes, based on a smaller subset of dimensional combinations
 Useful for dimensionality reduction and data smoothing
52
Histogram Analysis
 A histogram for an attribute, A,
partitions the data distribution of A into
disjoint subsets, referred to as buckets
or bins.
 If each bucket represents only a single
attribute–value/frequency pair, the
buckets are called singleton buckets.
 Often, buckets instead represent
continuous ranges for the given
attribute.

53
Histogram Analysis
 Often, buckets instead represent continuous ranges for the given attribute.

54
Clustering
 Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is “smeared”.
 Can have hierarchical clustering and be stored in multi-dimensional index
tree structures
 There are many choices of clustering definitions and clustering algorithms

55
Sampling
 Sampling: obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor performance in the presence
of skew
 Develop adaptive sampling methods, e.g., stratified sampling:
 Note: Sampling may not reduce database I/Os (page at a time)

56
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Once an object is selected, it is removed from the population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
 Used in conjunction with skewed data

57
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

58
Data Reduction 3: Data Compression
 String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless, but only limited manipulation is possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed without reconstructing
the whole
 Time sequence is not audio
 Typically short and vary slowly with time
 Dimensionality and numerosity reduction may also be considered as forms of data
compression

59
Data Compression

Original Data Compressed


Data

lossless

Original Data
Approximated

60
Data Transformation
 A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing

61
Normalization

Min-max normalization performs a linear transformation on the original data.

Suppose that minA and maxA are the minimum and maximum values of an attribute, A.

Min-max normalization maps a value, vi, of A to vi′ in the range [new_minA,new_maxA] by
computing


Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. We would like to map income to the range [0.0,1.0]. By min-max
normalization, a value of $73,600 for income is transformed to

62
Normalization

In z-score normalization (or zero-mean normalization), the values for an
attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A.

A value, vi, of A is normalized to vi′ by computing

 Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for
income is transformed to

63
Normalization

Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.

The number of decimal points moved depends on the maximum absolute
value of A.

A value, vi, of A is normalized to vi′ by computing

where j is the smallest integer such that max(|vi′|) < 1



Suppose that the recorded values of A range from −986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we
therefore divide each value by 1000 (i.e., j = 3) so that −986 normalizes to
−0.986 and 917 normalizes to 0.917.
64
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
65
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up
merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
66
Simple Discretization: Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky

67
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
68
Discretization Without Using Class Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results


69
Discretization by Classification & Correlation
Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
 Merge performed recursively, until a predefined stopping condition
70
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses to view data in
multiple granularity
 Concept hierarchy formation: Recursively reduce the data by collecting and replacing
low level concepts (such as numeric values for age) by higher level concepts (such as
youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
 Concept hierarchy can be automatically formed for both numeric and nominal data. For
numeric data, use discretization methods shown.

71
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly at the schema level by
users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the analysis of the number
of distinct values
 E.g., for a set of attributes: {street, city, state, country}

72
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest

level of the hierarchy


 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


73
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Multidimensional Data Model

Data warehouses and OLAP tools are based on a multidimensional
data model.

This model views data in the form of a data cube.

75
Data Cube: A Multidimensional Data Model

Example: AllElectronics wants to keep track of the it’s store’s sales with
respect to the time, item, and location.

76
Data Cube: A Multidimensional Data Model

Example: AllElectronics wants to keep track of the it’s store’s sales with
respect to the time, item, and location.

77
Data Cube: A Multidimensional Data Model


A data cube allows data to be
modeled and viewed in multiple
dimensions. It is defined by
dimensions and facts.


In general terms, dimensions are the
perspectives or entities with respect
to which an organization wants to
keep records.

78
Data Cube: A Multidimensional Data Model

Each dimension may have a
table associated with it, called a
dimension table, which further
describes the dimension.


For example, a dimension table
for item may contain the
attributes item name, brand,
and type.


Dimension tables can be
specified by users or experts, or
automatically generated and
adjusted based on data
distributions. 79
Data Cube: A Multidimensional Data Model

A multidimensional data model is
typically organized around a central
theme, such as sales.


This theme is represented by a fact
table which are numeric measures.


Examples of facts for a sales data
warehouse include dollars sold
(sales amount in dollars), units sold
(number of units sold), and amount
budgeted.

80
Data Cube: A Multidimensional Data Model

The cuboid that holds the lowest level of summarization is called the base cuboid.
81
Data Cube: A Multidimensional Data Model

The 0-D cuboid, which holds the highest level of summarization, is
called the apex cuboid.


In our example, this is the total sales, or dollars sold, summarized over
all four dimensions.


The apex cuboid is typically denoted by all.

82
Schemas for Multidimensional Database

The entity-relationship data model is commonly used in the design of
relational databases, where a database schema consists of a set of
entities and the relationships between them.


Such a data model is appropriate for online transaction processing.


A data warehouse, however, requires a concise, subject-oriented
schema that facilitates online data analysis.

83
Schemas for Multidimensional Database

The most popular data model for a data warehouse is a
multidimensional model, which can exist in the form of a star
schema, a snowflake schema, or a fact constellation schema.

84
Schemas for Multidimensional Database-Star Schema

The most common modeling paradigm is the star schema, in which the
data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.


The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.

85
Example of Star Schema
Schemas for Multidimensional Database-Snowflake Schema


The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the
data into additional tables.


The resulting schema graph forms a shape similar to a snowflake.

87
Example of Snowflake Schema
Star Schema Vs. Snowflake Schema

The major difference between the snowflake and star schema models
is that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.


Such a table is easy to maintain and saves storage space.


However, this space savings is negligible in comparison to the typical
magnitude of the fact table.

89
Star Schema Vs. Snowflake Schema

Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query.


Consequently, the system performance may be adversely impacted.


Hence, although the snowflake schema reduces redundancy, it is not
as popular as the star schema in data warehouse design.

90
Schemas for Multidimensional Database- Fact Constellation Schema

Sophisticated applications may require multiple fact tables to share
dimension tables.


This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.

91
Example of Fact Constellation
Dimensions: The Role of Concept Hierarchies

A concept hierarchy defines a sequence of mappings from a set of low-
level concepts to higher-level, more general concepts.

Consider a concept hierarchy for the dimension location

City values for location include Vancouver, Toronto, New York, and
Chicago.
Dimensions: The Role of Concept Hierarchies
Dimensions: The Role of Concept Hierarchies

Hierarchical and lattice structures of attributes in warehouse dimensions


(a) a hierarchy for location and (b) a lattice for time.
OLAP Operations

In the multidimensional model, data
are organized into multiple
dimensions, and each dimension
contains multiple levels of
abstraction defined by concept
hierarchies.

This organization provides users with
the flexibility to view data from
different perspectives.
OLAP Operations

A number of OLAP data cube operations exist to materialize these
different views, allowing interactive querying and analysis of the data
at hand.

Some major OLAP operations are as follows
1) Roll-Up
2) Drill-Down
3) Slice and Dice
4) Pivot
OLAP Operations – Roll-Up

The roll-up operation (also called the drill-up operation by some
vendors) performs aggregation on a data cube, either by climbing
up a concept hierarchy for a dimension or by dimension
reduction.
OLAP Operations – Roll-Up

Figure shows the result of a roll-up
operation performed on the central cube by
climbing up the concept hierarchy for
location.

This hierarchy was defined as the total
order “street < city < province or state <
country.”

The roll-up operation shown aggregates the
data by ascending the location hierarchy
from the level of city to the level of country.

In other words, rather than grouping the
data by city, the resulting cube groups the
data by country.
OLAP Operations – Roll-Up

When roll-up is performed by dimension reduction, one or more
dimensions are removed from the given cube.

For example, consider a sales data cube containing only the location
and time dimensions.

Roll-up may be performed by removing, say, the time dimension,
resulting in an aggregation of the total sales by location, rather than
by location and by time.
OLAP Operations – Drill-Down

Drill-down is the reverse of roll-up. It navigates from less detailed
data to more detailed data.

Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional
dimensions.
OLAP Operations – Drill-Down

Figure shows the result of a drill-down
operation performed on the central
cube by stepping down a concept
hierarchy for time defined as
“day < month < quarter < year.”

Drill-down occurs by descending the
time hierarchy from the level of
quarter to the more detailed level of
month.

The resulting data cube details the
total sales per month rather than
summarizing them by quarter.
OLAP Operations – Drill-Down

Because a drill-down adds more detail to the given data, it can also
be performed by adding new dimensions to a cube.

For example, a drill-down on the central cube of the Figure can occur
by introducing an additional dimension, such as customer group.
OLAP Operations – Slice and Dice

The slice operation performs a selection on one dimension of the
given cube, resulting in a subcube.

The dice operation defines a subcube by performing a selection on
two or more dimensions.
OLAP Operations – Slice and Dice

Figure shows a slice operation
where the sales data are
selected from the central cube
for the dimension time using
the criterion time = “Q1.”
OLAP Operations – Slice and Dice

Figure shows a dice operation on the
central cube based on the following
selection criteria that involve three
dimensions:
(location = “Toronto or “Vancouver”)
and
(time = “Q1” or “Q2”)
and
(item = “home entertainment” or “computer”).
OLAP Operations – Pivot

Pivot (also called rotate) is a visualization operation that rotates the data
axes in view to provide an alternative data presentation.
OLAP Operations – Pivot

Figure shows a pivot operation
where the item and location axes in
a 2-D slice are rotated.

Other examples include rotating the
axes in a 3-D cube, or transforming
a 3-D cube into a series of 2-D
planes.
OLAP Operations – Other OLAP operations

Some OLAP systems offer additional drilling operations

For example, drill-across executes queries involving (i.e., across) more
than one fact table.

The drill-through operation uses relational SQL facilities to drill through
the bottom level of a data cube down to its back-end relational tables.
Contents
1. Data Types and Attributes
2. Data Pre-processing
3. OLAP & Multidimensional Data Analysis
4. Various Similarity Measures
Similarity and Dissimilarity

Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]

Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies

Proximity refers to a similarity or dissimilarity

111
Data Matrix and Dissimilarity Matrix


Data matrix (or object-by-attribute structure):
- This structure stores the n data objects in the form of
a relational table, or n-by-p matrix (n objects × p
attributes).


Dissimilarity matrix(or object-by-object structure):
- This structure stores a collection of proximities that
are available for all pairs of n objects.
- It is often represented by an n-by-n table:

112
Example: Data Matrix and Dissimilarity Matrix
x2 x4 Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1

Dissimilarity Matrix

x3
(with Euclidean Distance)
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

113
Proximity Measure for Nominal Attributes


Can take 2 or more states, e.g., red, yellow, blue, green (generalization of
a binary attribute)

Method 1: Simple matching
– m: # of matches, p: total # of variables


Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M nominal states

114
Proximity Measure for Nominal Attributes
Example : Compute the dissimilarity matrix for given data.


Since here we have one nominal attribute, test-1, we set p = 1

d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ.

115
Proximity Measure for Binary Attributes

A binary attribute has only one of two states: 0 and 1, where 0 means that the attribute is
absent, and 1 means that it is present.

To compute the dissimilarity between two binary attributes we can compute a
dissimilarity matrix from the given binary data.

116
Proximity Measure for Binary Attributes

Dissimilarity that is based on symmetric binary attributes is called symmetric binary
dissimilarity. If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is

117
Proximity Measure for Binary Attributes

For asymmetric binary attributes, the two states are not equally important, such as the
positive (1) and negative (0) outcomes of a disease test.

The dissimilarity based on these attributes is called asymmetric binary dissimilarity,
where the number of negative matches, t, is considered unimportant and is thus ignored
in the following computation:

118
Proximity Measure for Binary Attributes

Complementarily, we can measure the difference between two binary attributes based
on the notion of similarity instead of dissimilarity. For example, the asymmetric binary
similarity between the objects i and j can be computed as


The coefficient sim(i, j) of is called the Jaccard coefficient.
119
Dissimilarity between Binary Variables

Example

Gender is a symmetric attribute


The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N be 0

120
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric.

121
Special Cases of Minkowski Distance
● h = 1: Manhattan (city block, L1 norm) distance
– E.g., the Hamming distance: the number of bits that are different between two
binary vectors

● h = 2: (L2 norm) Euclidean distance

● h  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component (attribute) of the
vectors

122
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2
Supremum
x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
123
Ordinal Variables

The values of an ordinal attribute have a meaningful order or ranking about them, yet
the magnitude between successive values is unknown.

Let M represent the number of possible states that an ordinal attribute can have.
These ordered states define the ranking 1, . . . , Mf .

Suppose that f is an attribute from a set of ordinal attributes describing n objects. The
dissimilarity computation with respect to f involves the following steps:

124
Ordinal Variables

Dissimilarity can then be computed using any of the distance measures for numeric
attributes, using zif to represent the f value for the ith object.

125
Ordinal Variables ✔
Suppose that we have the sample data as shown, except
that this time only the object-identifier and the continuous
Example ordinal attribute, test-2, are available.

There are three states for test-2: fair, good, and excellent,
that is, Mf = 3.

If we replace each value for test-2 by its rank, the four
objects are assigned the ranks 3, 1, 2, and 3, respectively.

We then normalize the ranking by mapping rank 1 to 0.0,
rank 2 to 0.5, and rank 3 to 1.0.

Finally, we can use the Euclidean distance, which results in
the following dissimilarity matrix:

126
Attributes of Mixed Type

A database may contain all attribute types
– Nominal, symmetric binary, asymmetric binary, numeric, ordinal

One may use a weighted formula to combine their effects

127
Attributes of Mixed Type
Example
For test-I For test-2


We can now use the dissimilarity matrices for the three
For test-3 attributes

The indicator δij(f) = 1 for each of the three attributes, f .


For example

The resulting dissimilarity matrix obtained is

128
Cosine Similarity

A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.


Other vector objects: gene features in micro-arrays, …

Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

Cosine measure:


where ||x|| is the Euclidean norm of vector x = (x1, x2,..., xp), defined as

129
Example: Cosine Similarity
Suppose that x and y are the first two
term-frequency vectors as shown.
That is,
X = (5,0,3,0,2,0,0,2,0,0)
Y = (3,0,2,0,1,1,0,1,0,1).
How similar are x and y?

130
Simple Matching Coefficient

The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for
comparing the similarity and diversity of sample sets.

Given two objects, A and B, each with n binary attributes, SMC is defined as:

where:
M00 is the total number of attributes where A and B both have a value of 0.
M11 is the total number of attributes where A and B both have a value of 1.
M01 is the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
M10 is the total number of attributes where the attribute of A is 1 and the attribute of B is 0.

The simple matching distance (SMD), which measures dissimilarity between sample sets, is given by 1 − SMC
131
Review Question
1) Explain data warehouse architecture with it’s analytical processing.
2) Suppose that a data warehouse consists of the four dimensions data,
spectator, location, and game, and the two measures count and charge, where
charge is the fare that a spectator pays when watching a game on a given
date. Spectators may be students, adults or seniors, with each category having
its own charge rate.
a) Draw a star schema diagram for the data warehouse.
b) Starting with the base cuboid [data, spectator, location, game], what specific
OLAP operations should you perform in order to list the total charge paid by
student spectators at Dashrath Stadium in 2021?
Review Question

3) Describe snowflake schema with an example.


4) Describe OLAP and operations on OLAP with suitable examples.
5) Suppose that a data warehouse for a sales company consists of five
dimensions: time, location, swpplier, brand, and product, and two measures:
count and price.
a) Draw a snowflake schema diagram for the data warehouse.
b) Starting with the base cuboid [time, location, supplier, brand, product), what
specific OLAP operations should one perform in order to list the total count for
a certain brand for each state per year, (assume location has three levels:
country, state, city; and assume time has three levels: year, month, day)?
Review Question

6) Use the following methods to normalize the data: 200, 300, 400, 600
and 1000.
a) Min-max normalization by setting min=0 and max=1.
b) Z-score normalization.
c) Normalization by decimal scaling.
Review Question

7) Find the principal components and the proportion of the total variance
explained by each when the covariance matrix of the three random
variables X1, X2 and X3 is :
Review Question

8) Given the following points compute the distance matrix using the
Manhattan and the supermum distance.
Review Question

8) Given the following two vectors compute the Cosine similarity between
them.
D1 = [ 4 0 2 0 1]
D2 = [ 2 0 0 2 2]
9) Given the following two binary vectors compute the Jaccard similarity
and Simple Matching Coefficient.
P = [ 0 0 1 1 0 1]
Q = [1 1 1 1 0 1]
Review Question

10) Why data preprocessing is necessary? Explain the methods for data
preprocessing to maintain data quality.
11) What is data pre-processing? Explain data sampling and
dimensionality reduction in data pre-processing with suitable example.
12) What are the approaches to handle missing data?
13) What are the measuring elements of data quality? Explain differnet
data transformation by normalization methods with an example.

You might also like