Week 2 - Data Quality
Week 2 - Data Quality
Week 2:
Data Quality
10
Informatics Engineering | Universitas Surabaya
Data is Dirty
• Real world data are dirty: lots of potentially incorrect (instrument faulty,
human/computer error, transmission error) – incomplete, noisy, inconsistent.
• Incomplete
– Lacking: attribute values, attributes of interest; or containing only aggregrate data
– e.g., Occupation = “” (missing data)
• Noisy
– Containing noise, errors, or outliers
– e.g., Salary = -10 (an error)
• Inconsistent
– Containing discrepancies in codes or names
– Age = 42, or use Birthday = “03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
– Disrepancy between duplicate records
• Intentional (e.g., disguised missing data)
– Jan 1 as everyone’s birthday?
Data Cleaning (Data Cleansing)
• Attempt to fill in missing values
• Smooth out noise while identifying outliers
• Correct inconsistencies in the data
Data Cleaning
Incomplete (missing) values
• Data is not always available:
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children
• Missing data may be due to:
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data were not entered due to misunderstanding
– Certain data may not be considered important at the time of entry
– Did not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple:
– Example:
• the class label is missing
• the percentage of missing values per attribute varies considerably.
– Not make use of the remaining attribute’s values in the tuple
• Fill in the missing value manually: tedious + infeasible for a large dataset with
many missing values.
• Fill in it automatically with:
– A global constant, e.g., “unknown” – not foolproof
– A measure of central tendency (mean or median)
– The attribute mean or median for all samples belonging to the same class as the given
tuple: smarter
– The most probable value: may be determined with regression, Bayesian formula,
decision tree induction.
Noisy Data
• What is noise? Random error or variance in a measured variable,
modification of original values.
• Examples:
– Distortion of a person’s voice when talking on a poor phone and “snow”
on television screen.
How to Handle Noisy Data?
• Binning: to smooth a sorted data value by consulting the values
around it.
– First sort data and partition into (equal-frequency) bins
– Then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
– Smooth by fitting the data into regression functions
• Outlier analysis: clustering
– Detect and remove outliers
• Semi-supervised: combined computer and human inspection
– Detect suspicious values and check by human
Data Integration
17
Informatics Engineering | Universitas Surabaya
Data Integration
• Combining data from multiple sources into a coherent data store.
• Why data integration?
– Help reduce/avoid noise
– Get more complete picture
– Improve mining speed and quality
• Challenges:
– semantic heterogeneity and
– structure of data
Data Integration Problems
• Entity identification problem: to match schema and objects from diff
sources.
– e.g., A.cust_id B.cust_no
– Integrate metadata from different sources
• Redudancy and correlation analysis: to spot correlated numerical
and nominal data
• Tuple duplication
• Data value conflict detection and resolution
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases.
– Object identification: the same attribute or object may have different
names in different databases
– Derivable data: one attribute may be a derived attribute in another
attribute, e.g., annual revenue
• What’s the problem?
– 𝑌 = 2𝑋 → 𝑌 = 𝑋1 + 𝑋2 𝑌 = 3𝑋1 − 𝑋2 𝑌 = −1291𝑋1 + 1293𝑋2
• Redundant attributes may be detected by correlation analysis and
covariance analysis.
Data Transformation
25
Informatics Engineering | Universitas Surabaya
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values.
• Methods:
– Smoothing: remove noise from data
– Attribute/feature construction: new attributes constructed from the given ones.
– Aggregation: summarization, data cube construction
– Normalization: scaled to fall within a smaller, specified range
• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling
– Discretization: concept hierarchy climbing
Normalization
• Attribute data are scaled so as to fall within a smaller range, such as
-1 to +1, or 0 to 1.
• Normalizing the data attempts to give all attributes an equal weight.
• Example case:
– Classification using NN backpropagation algo: normalizing input values
help speed up the learning phase.
– Distance-based methods: prevent attributes with large ranges from
outweighing attributes with smaller ranges.
– No prior knowledge of the data.
Min-max normalization
• [minA, maxA] to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Example: Let income range $12,000 to $98,000 normalized to [0, 1]
– Then $73,000 is mapped to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
Z-score normalization
• Zero-mean normalization: the distance between the raw value and
the population mean in the unit of the standard deviation.
73,600 − 54,000
= 1.225
16,000
Normalization by decimal scaling
• Normalizes by moving the decimal point of values of an attribute.
v
v' = j where j is the smallest integer such that Max(|ν’|) < 1
10
41
Informatics Engineering | Universitas Surabaya
Dimensionality Reduction
• The process of reducing the number of attributes or features under
consideration.
• Benefit:
– It reduces the number of attributes appearing in the discovered patterns, helping to
make the patterns easier to understand.
• Redundant attributes:
– Duplicate much or all of the information contained in one or more other attributes.
– Example: purchase price of a product and the amount of sales tax paid
• Irrelevant attributes:
– Contain no information that is useful for the data mining task.
– Example: a student’s ID is often irrelevant to the task of predicting his/her GPA
Attribute Subset Selection (Greedy)
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a dataset more effectively than the original ones.
• Three general methodologies:
– Attribute extraction: domain-specific
– Mapping data to new space
• e.g., Fourier transformation, wavelet transformation, manifold approaches
– Attribute construction
• Combining features
• Data discretization
Principal Component Analysis (PCA)
• PCA: a statistical procedure that uses an
orthogonal transformation to convert a set
of observations of possibly correlated
variables into a set of values of linearly
uncorrelated variables called principal
components.
• The original data are projected onto a
much smaller space, resulting in
dimensionality reduction.
• Method:
– Find the eigenvectors of the covariance
matrix,
– And these eigenvectors define the new
space.
Ball travels in a straight line. Data from
three cameras contain much redundancy
PCA Method
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) best used to represent data.
– Normalize input data: each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors.
– The principal components are sorted in order of decreasing significance or
strength.
– Since the component are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, to reconstruct a good approx. of the original
data).
• Works for numeric data only.
Exercises
54
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Normalization
A recorded data of age D = {13, 15, 16, 16, 19, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70} is used for
analysis.
57
Informatics Engineering | Universitas Surabaya