HIT391
MACHINE LEARNING:
ADVANCEMENTS
AND APPLICATIONS
Lecturer: Dr. Yan Zhang
Email: [email protected]
Office: Purple 12.3.4
Week 3:
Data Analysis and Data Pre-processing
Learning Outcomes
- Data analysis
- Data preprocessing
An example of Supervised Learning
Data
��
ML algrithm
Training
Classification
Data
�� ��
Step 1: Training Algorithms
Learning from data
Classifier �
(Model)
Learned
model
Step2: Testing
(<40, high, yes, fair) Classifier Buy
��
(Model) computer?
Prediction
Outline
An Overview
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data
Discretization
Summary
An Overview
Data Quality: Why Preprocess the Data?
Measures for data quality: a multidimensional view
– Accuracy: correct or wrong, accurate or not, e.g.,
Synthetic data
– Completeness: not recorded, unavailable, …e.g., missing
data.
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update? e.g., collected 10 years ago
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
Major Tasks in Data Preprocessing
Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
– Integration of multiple databases, data cubes, or files
Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
Data transformation and data discretization
– Normalization. e.g., all data need to be normalized a range [0,1]
– Concept hierarchy generation
Data Cleaning
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
– Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
Missing data may need to be inferred, e.g., recommeder
system
How to Handle Missing Data?
Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective
when the % of missing values per attribute varies
considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
– a global constant: e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the
same class: smarter
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
– smooth by fitting the data into regression functions
Clustering
– detect and remove outliers
Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible
outliers)
Data Integration
Data integration
Data integration:
– Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id , B.cust-#
– Integrate metadata from different sources
Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy
Redundant data occur often when integration of
multiple databases
– Object identification: The same attribute or object
may have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the
respective means of A and B, σA and σB are the
respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s
values increase as B’s). The higher, the stronger
correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, and are the respective mean
or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true。
19
Co-Variance: An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one
week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Data Reduction
Data Reduction Methods
Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or
almost the same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.
Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
Principal Components Analysis (PCA)
Feature subset selection, feature creation
– Data reduction
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
– Data compression
Dimensionality Reduction
Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in
data
The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors of
the covariance matrix, and these eigenvectors define the new
space
x2
x1
Steps of PCA
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be reduced
by eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct
a good approximation of the original data)
Works for numeric data only
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction
Reduce data volume by choosing alternative, smaller forms of data
representation
Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
Regression and Log-Linear Models
Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
– Approximates discrete multidimensional probability
distributions
y
Regression Analysis
Y2
Regression analysis: A collective name
for techniques for the modeling and
analysis of numerical data consisting Y1 y=x+1
of values of a dependent variable (also
called response variable or
x
measurement) and of one or more X1
independent variables (aka.
explanatory variables or predictors) Used for prediction
The parameters are estimated so as to (including forecasting of
time-series data),
give a "best fit" of the data
inference, hypothesis
Most commonly the best fit is testing, and modeling of
evaluated by using the least squares causal relationships
method, but other criteria have also
been used 29
Clustering
Partition data set into clusters based
on similarity, and store cluster
representation (e.g., centroid and
diameter) only
Can be very effective if data is
clustered but not if data is “smeared”
Can have hierarchical clustering and
be stored in multi-dimensional index
tree structures
There are many choices of clustering
definitions and clustering algorithms
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of the
data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
Note: Sampling may not reduce database I/Os (page
at a time)
Types of Sampling
Simple random sampling
– There is an equal probability of selecting any
particular item
Sampling without replacement
– Once an object is selected, it is removed from the
population
Sampling with replacement
– A selected object is not removed from the population
Stratified sampling:
– Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
Stratified Sampling
Raw Data Cluster/Stratified Sample
Data Compression
String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible without
expansion
Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
Time sequence is not audio
– Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be considered as
forms of data compression
Data Compression
lossy
Original Data Compressed
Data
lossless
o ssy
l
Original Data
Approximated
Data Transformation
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
New attributes constructed from the given ones
– Aggregation: Summarization
– Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
– Discretization: one-hot encoding, Concept hierarchy climbing
Normalization
Min-max normalization: to [new_minA, new_maxA]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1
38
Discretization
Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Data Discretization Methods
Typical methods: All the methods can be applied
recursively
– Binning
Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or
bottom-up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation analysis (unsupervised, bottom-up
merge)
Discretization by Classification &
Correlation Analysis
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a
data warehouse
Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult, or
senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric
and nominal data. For numeric data, use discretization methods
shown.
Summary
Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
– Entity identification problem
– Remove redundancies
– Detect inconsistencies
Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
Data transformation and data discretization
– Normalization
– Concept hierarchy generation
References
Han, Jiawei, Jian Pei, and Hanghang Tong. Data
mining: concepts and techniques. Morgan
kaufmann, 2022.
Chapter 3 - Concepts and Techniques, Jiawei
Han, Micheline Kamber, and Jian Pei, University
of Illinois at Urbana-Champaign & Simon Fraser
University ©2011 Han, Kamber & Pei.