0% found this document useful (0 votes)

52 views43 pages

HIT391-week 3-New

Uploaded by

doanmyngoc22121995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views43 pages

HIT391-week 3-New

Uploaded by

doanmyngoc22121995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

HIT391

MACHINE LEARNING:
ADVANCEMENTS
AND APPLICATIONS
 Lecturer: Dr. Yan Zhang

 Email: [email protected]

 Office: Purple 12.3.4

Week 3:
Data Analysis and Data Pre-processing

 Learning Outcomes

- Data analysis
- Data preprocessing
An example of Supervised Learning

Data
��
ML algrithm
Training
Classification
Data
��
Step 1: Training Algorithms

Learning from data

Classifier �
(Model)
Learned
model

Step2: Testing
(<40, high, yes, fair) Classifier Buy
��
(Model) computer?
Prediction
Outline
 An Overview
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
 Summary
An Overview
Data Quality: Why Preprocess the Data?

 Measures for data quality: a multidimensional view

– Accuracy: correct or wrong, accurate or not, e.g.,
Synthetic data
– Completeness: not recorded, unavailable, …e.g., missing
data.
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update? e.g., collected 10 years ago
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?
Major Tasks in Data Preprocessing
 Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
– Integration of multiple databases, data cubes, or files
 Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
 Data transformation and data discretization
– Normalization. e.g., all data need to be normalized a range [0,1]
– Concept hierarchy generation
Data Cleaning
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
– Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
 Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
 Missing data may need to be inferred, e.g., recommeder
system
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective
when the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
– a global constant: e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the
same class: smarter
Noisy Data
 Noise: random error or variance in a measured variable

 Incorrect attribute values may be due to

– faulty data collection instruments

– data entry problems

– data transmission problems

– technology limitation

– inconsistency in naming convention

 Other data problems which require data cleaning

– duplicate records

– incomplete data

– inconsistent data
How to Handle Noisy Data?
 Binning

– first sort data and partition into (equal-frequency) bins

– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

 Regression

– smooth by fitting the data into regression functions

 Clustering

– detect and remove outliers

 Combined computer and human inspection

– detect suspicious values and check by human (e.g., deal with possible
outliers)
Data Integration
Data integration
 Data integration:
– Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id , B.cust-#

– Integrate metadata from different sources

 Entity identification problem:

– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton

 Detecting and resolving data value conflicts

– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy
 Redundant data occur often when integration of
multiple databases
– Object identification: The same attribute or object
may have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)

where n is the number of tuples, and are the

respective means of A and B, σA and σB are the
respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s
values increase as B’s). The higher, the stronger
correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, and are the respective mean

or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true。

19
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one

week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

 Question: If the stocks are affected by the same industry trends,

will their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

Data Reduction
Data Reduction Methods
 Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or
almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.
 Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
– Data reduction
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
– Data compression
Dimensionality Reduction
 Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
 Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in
data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors of
the covariance matrix, and these eigenvectors define the new
space
x2

x1
Steps of PCA
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be reduced
by eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct
a good approximation of the original data)
 Works for numeric data only
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction
 Reduce data volume by choosing alternative, smaller forms of data
representation

 Parametric methods (e.g., regression)

– Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)

– Ex.: Log-linear models—obtain value at a point in m-D space as the

product on appropriate marginal subspaces

 Non-parametric methods

– Do not assume models

– Major families: histograms, clustering, sampling, …

Regression and Log-Linear Models

 Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line

 Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector

 Log-linear model
– Approximates discrete multidimensional probability
distributions
y
Regression Analysis
Y2
 Regression analysis: A collective name
for techniques for the modeling and
analysis of numerical data consisting Y1 y=x+1
of values of a dependent variable (also
called response variable or
x
measurement) and of one or more X1
independent variables (aka.
explanatory variables or predictors)  Used for prediction
 The parameters are estimated so as to (including forecasting of
time-series data),
give a "best fit" of the data
inference, hypothesis
 Most commonly the best fit is testing, and modeling of
evaluated by using the least squares causal relationships
method, but other criteria have also
been used 29
Clustering
 Partition data set into clusters based
on similarity, and store cluster
representation (e.g., centroid and
diameter) only
 Can be very effective if data is
clustered but not if data is “smeared”
 Can have hierarchical clustering and
be stored in multi-dimensional index
tree structures
 There are many choices of clustering
definitions and clustering algorithms
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
data
– Simple random sampling may have very poor
performance in the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page
at a time)
Types of Sampling
 Simple random sampling
– There is an equal probability of selecting any
particular item
 Sampling without replacement
– Once an object is selected, it is removed from the
population
 Sampling with replacement
– A selected object is not removed from the population
 Stratified sampling:
– Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
Stratified Sampling

Raw Data Cluster/Stratified Sample

Data Compression
 String compression

– There are extensive theories and well-tuned algorithms

– Typically lossless, but only limited manipulation is possible without

expansion

 Audio/video compression

– Typically lossy compression, with progressive refinement

– Sometimes small fragments of signal can be reconstructed without

reconstructing the whole

 Time sequence is not audio

– Typically short and vary slowly with time

 Dimensionality and numerosity reduction may also be considered as

forms of data compression
Data Compression

lossy
Original Data Compressed
Data
lossless

o ssy
l
Original Data
Approximated
Data Transformation
Data Transformation
 A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values

 Methods

– Smoothing: Remove noise from data

– Attribute/feature construction

 New attributes constructed from the given ones

– Aggregation: Summarization

– Normalization: Scaled to fall within a smaller, specified range

 min-max normalization

 z-score normalization

 normalization by decimal scaling

– Discretization: one-hot encoding, Concept hierarchy climbing

Normalization

 Min-max normalization: to [new_minA, new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0,

1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):

– Ex. Let μ = 54,000, σ = 16,000. Then

 Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1

38
Discretization
 Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers

 Discretization: Divide the range of a continuous attribute into intervals

– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Data Discretization Methods
 Typical methods: All the methods can be applied
recursively
– Binning
 Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or
bottom-up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation analysis (unsupervised, bottom-up
merge)
Discretization by Classification &
Correlation Analysis
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a
data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult, or
senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric
and nominal data. For numeric data, use discretization methods
shown.
Summary
 Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability

 Data cleaning: e.g. missing/noisy values, outliers

 Data integration from multiple sources:

– Entity identification problem

– Remove redundancies

– Detect inconsistencies

 Data reduction

– Dimensionality reduction

– Numerosity reduction

– Data compression

 Data transformation and data discretization

– Normalization

– Concept hierarchy generation

References
 Han, Jiawei, Jian Pei, and Hanghang Tong. Data
mining: concepts and techniques. Morgan
kaufmann, 2022.

 Chapter 3 - Concepts and Techniques, Jiawei

Han, Micheline Kamber, and Jian Pei, University
of Illinois at Urbana-Champaign & Simon Fraser
University ©2011 Han, Kamber & Pei.

ZE205E/ZE230E Hydraulic Excavator Maintenance Manual: First Edition of January 2013
89% (9)
ZE205E/ZE230E Hydraulic Excavator Maintenance Manual: First Edition of January 2013
73 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
DM Merged
No ratings yet
DM Merged
169 pages
unit 1 c
No ratings yet
unit 1 c
63 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Lec 7
No ratings yet
Lec 7
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Unit 3
No ratings yet
Unit 3
164 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Mining
No ratings yet
Mining
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Module 2
No ratings yet
Module 2
62 pages
DP
No ratings yet
DP
44 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
MA3251 Important Questions
No ratings yet
MA3251 Important Questions
7 pages
Signage Drawing Final
No ratings yet
Signage Drawing Final
2 pages
A Wide Compensation Ranged Hybrid Statcom With Low DC - Link Voltage PDF
No ratings yet
A Wide Compensation Ranged Hybrid Statcom With Low DC - Link Voltage PDF
7 pages
NHB NX100 TRB 3.0 PDF
No ratings yet
NHB NX100 TRB 3.0 PDF
257 pages
Trajectory Tracking Control of A Differential Wheeled Mobile Robot: A Polar Coordinates Control and LQR Comparison
No ratings yet
Trajectory Tracking Control of A Differential Wheeled Mobile Robot: A Polar Coordinates Control and LQR Comparison
4 pages
05 Compton Scattering
No ratings yet
05 Compton Scattering
36 pages
0417 s17 QP 31
No ratings yet
0417 s17 QP 31
8 pages
Anmol Sip-1
No ratings yet
Anmol Sip-1
21 pages
Dell Technologies Networking OS10 - How To Configure Port-Channels
No ratings yet
Dell Technologies Networking OS10 - How To Configure Port-Channels
3 pages
DX Diag
No ratings yet
DX Diag
37 pages
Instruction
No ratings yet
Instruction
14 pages
As & A Level Physics 9702 - 42 Paper 4 A Level Structured Questions Oct - Nov 2023
No ratings yet
As & A Level Physics 9702 - 42 Paper 4 A Level Structured Questions Oct - Nov 2023
2 pages
Ms. Sana Tahir at Giki: Engineering Statistics ES-202
No ratings yet
Ms. Sana Tahir at Giki: Engineering Statistics ES-202
17 pages
Stachowiak - 1998 - Numerical Characterization of Wear Particles Morphology and Angularity of Particles and Surfaces
No ratings yet
Stachowiak - 1998 - Numerical Characterization of Wear Particles Morphology and Angularity of Particles and Surfaces
19 pages
Fu Et Al 2016 One Pot Synthesis of Block Copolymers by Orthogonal Ring Opening Polymerization and Pet Raft
No ratings yet
Fu Et Al 2016 One Pot Synthesis of Block Copolymers by Orthogonal Ring Opening Polymerization and Pet Raft
6 pages
Finite Element Plates and Shells
No ratings yet
Finite Element Plates and Shells
7 pages
Urban Bus Specifications
No ratings yet
Urban Bus Specifications
210 pages
Basic Amplifier: Single-Stage Transistor Amplifier
No ratings yet
Basic Amplifier: Single-Stage Transistor Amplifier
5 pages
Tonearm Setup Audio
No ratings yet
Tonearm Setup Audio
12 pages
Which of The Following Statements Are CORRECT A... - Chegg - Com 1 PDF
No ratings yet
Which of The Following Statements Are CORRECT A... - Chegg - Com 1 PDF
2 pages
M-10iD Product Information - 265
No ratings yet
M-10iD Product Information - 265
1 page
Classes in C#
No ratings yet
Classes in C#
20 pages
PyMol Practical
No ratings yet
PyMol Practical
7 pages
3D Standard in Prenatal Prof Merz
No ratings yet
3D Standard in Prenatal Prof Merz
10 pages
First Steps With Jax, Part 1 - Daniel Rothenberg
No ratings yet
First Steps With Jax, Part 1 - Daniel Rothenberg
11 pages
Lec 1 Introduction To Ai
No ratings yet
Lec 1 Introduction To Ai
23 pages
Gaseous Diffusion Coefficient Apparatus
No ratings yet
Gaseous Diffusion Coefficient Apparatus
3 pages
Microsoft Word
No ratings yet
Microsoft Word
28 pages
EPEVER Datasheet TRIRON
No ratings yet
EPEVER Datasheet TRIRON
3 pages

HIT391-week 3-New

Uploaded by

HIT391-week 3-New

Uploaded by

HIT391

 Office: Purple 12.3.4

Learning from data

 Measures for data quality: a multidimensional view

 Incorrect attribute values may be due to

– faulty data collection instruments

– data entry problems

– data transmission problems

– inconsistency in naming convention

 Other data problems which require data cleaning

– first sort data and partition into (equal-frequency) bins

– smooth by fitting the data into regression functions

– detect and remove outliers

 Combined computer and human inspection

 Schema integration: e.g., A.cust-id , B.cust-#

 Entity identification problem:

 Detecting and resolving data value conflicts

where n is the number of tuples, and are the

where n is the number of tuples, and are the respective mean

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one

 Question: If the stocks are affected by the same industry trends,

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

 Parametric methods (e.g., regression)

– Ex.: Log-linear models—obtain value at a point in m-D space as the

– Do not assume models

– Major families: histograms, clustering, sampling, …

Raw Data Cluster/Stratified Sample

– There are extensive theories and well-tuned algorithms

– Typically lossless, but only limited manipulation is possible without

– Typically lossy compression, with progressive refinement

– Sometimes small fragments of signal can be reconstructed without

 Time sequence is not audio

– Typically short and vary slowly with time

 Dimensionality and numerosity reduction may also be considered as

– Smoothing: Remove noise from data

 New attributes constructed from the given ones

– Normalization: Scaled to fall within a smaller, specified range

 normalization by decimal scaling

– Discretization: one-hot encoding, Concept hierarchy climbing

 Min-max normalization: to [new_minA, new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0,

– Ex. Let μ = 54,000, σ = 16,000. Then

 Discretization: Divide the range of a continuous attribute into intervals

 Data cleaning: e.g. missing/noisy values, outliers

 Data integration from multiple sources:

– Entity identification problem

 Data transformation and data discretization

– Concept hierarchy generation

 Chapter 3 - Concepts and Techniques, Jiawei

You might also like