0% found this document useful (0 votes)
0 views

Week 2 - Data Quality

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Week 2 - Data Quality

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

1604C331 Data Mining

Week 2:
Data Quality

Odd Semester 2024-2025


20102620240905
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Data Quality
• Poor data quality negatively affects many data processing efforts.

• Data mining example:


– a classification model for detecting people who are loan risks is built
using poor data:
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default
What is Data Preprocessing?
Major Tasks
• Data Cleaning
– Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data Integration
– Integration of multiple databases, data cubes, or files
• Data Transformation & Data Discretization
– Normalization
– Concept hierarchy generation
• Data Reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
Data Quality Measures
• Accuracy: correct or wrong, accurate or not

• Completeness: not recorded, unavailable

• Consistency: some modified but some not, dangling

• Timeliness: timely update?

• Believability: how trustable the data are correct?

• Interpretability: how easily the data can be understood?


Data Cleaning

10
Informatics Engineering | Universitas Surabaya
Data is Dirty
• Real world data are dirty: lots of potentially incorrect (instrument faulty,
human/computer error, transmission error) – incomplete, noisy, inconsistent.
• Incomplete
– Lacking: attribute values, attributes of interest; or containing only aggregrate data
– e.g., Occupation = “” (missing data)
• Noisy
– Containing noise, errors, or outliers
– e.g., Salary = -10 (an error)
• Inconsistent
– Containing discrepancies in codes or names
– Age = 42, or use Birthday = “03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
– Disrepancy between duplicate records
• Intentional (e.g., disguised missing data)
– Jan 1 as everyone’s birthday?
Data Cleaning (Data Cleansing)
• Attempt to fill in missing values
• Smooth out noise while identifying outliers
• Correct inconsistencies in the data
Data Cleaning
Incomplete (missing) values
• Data is not always available:
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children
• Missing data may be due to:
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data were not entered due to misunderstanding
– Certain data may not be considered important at the time of entry
– Did not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple:
– Example:
• the class label is missing
• the percentage of missing values per attribute varies considerably.
– Not make use of the remaining attribute’s values in the tuple
• Fill in the missing value manually: tedious + infeasible for a large dataset with
many missing values.
• Fill in it automatically with:
– A global constant, e.g., “unknown” – not foolproof
– A measure of central tendency (mean or median)
– The attribute mean or median for all samples belonging to the same class as the given
tuple: smarter
– The most probable value: may be determined with regression, Bayesian formula,
decision tree induction.
Noisy Data
• What is noise? Random error or variance in a measured variable,
modification of original values.
• Examples:
– Distortion of a person’s voice when talking on a poor phone and “snow”
on television screen.
How to Handle Noisy Data?
• Binning: to smooth a sorted data value by consulting the values
around it.
– First sort data and partition into (equal-frequency) bins
– Then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
– Smooth by fitting the data into regression functions
• Outlier analysis: clustering
– Detect and remove outliers
• Semi-supervised: combined computer and human inspection
– Detect suspicious values and check by human
Data Integration

17
Informatics Engineering | Universitas Surabaya
Data Integration
• Combining data from multiple sources into a coherent data store.
• Why data integration?
– Help reduce/avoid noise
– Get more complete picture
– Improve mining speed and quality
• Challenges:
– semantic heterogeneity and
– structure of data
Data Integration Problems
• Entity identification problem: to match schema and objects from diff
sources.
– e.g., A.cust_id  B.cust_no
– Integrate metadata from different sources
• Redudancy and correlation analysis: to spot correlated numerical
and nominal data
• Tuple duplication
• Data value conflict detection and resolution
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases.
– Object identification: the same attribute or object may have different
names in different databases
– Derivable data: one attribute may be a derived attribute in another
attribute, e.g., annual revenue
• What’s the problem?
– 𝑌 = 2𝑋 → 𝑌 = 𝑋1 + 𝑋2 𝑌 = 3𝑋1 − 𝑋2 𝑌 = −1291𝑋1 + 1293𝑋2
• Redundant attributes may be detected by correlation analysis and
covariance analysis.
Data Transformation

25
Informatics Engineering | Universitas Surabaya
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values.
• Methods:
– Smoothing: remove noise from data
– Attribute/feature construction: new attributes constructed from the given ones.
– Aggregation: summarization, data cube construction
– Normalization: scaled to fall within a smaller, specified range
• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling
– Discretization: concept hierarchy climbing
Normalization
• Attribute data are scaled so as to fall within a smaller range, such as
-1 to +1, or 0 to 1.
• Normalizing the data attempts to give all attributes an equal weight.
• Example case:
– Classification using NN backpropagation algo: normalizing input values
help speed up the learning phase.
– Distance-based methods: prevent attributes with large ranges from
outweighing attributes with smaller ranges.
– No prior knowledge of the data.
Min-max normalization
• [minA, maxA] to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Example: Let income range $12,000 to $98,000 normalized to [0, 1]
– Then $73,000 is mapped to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
Z-score normalization
• Zero-mean normalization: the distance between the raw value and
the population mean in the unit of the standard deviation.

• Example: Let μ = 54,000, σ = 16,000.


– Then $73,600 is transformed to:

73,600 − 54,000
= 1.225
16,000
Normalization by decimal scaling
• Normalizes by moving the decimal point of values of an attribute.
v
v' = j where j is the smallest integer such that Max(|ν’|) < 1
10

• Example: Let recorded values range -986 to 917


– Then -986 is normalized to -0.986 and 917 is normalized to 0.917
Discretization
• Replaces the raw values of a numeric attribute by interval labels or
conceptual labels.
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs Unsupervised
– Split/Cut (top-down) vs Merge (bottom-up)
– Discretization can be performed recursively on an attribute.
Data Discretization Methods
• Binning
– Top-down split, unsupervised
• Histogram analysis
– Top-down split, unsupervised
• Clustering analysis
– Top-down split or bottom-up merge, unsupervised
• Decision-tree analysis
– Top-down split, supervised
• Correlation analysis
– Bottom-up merge, unsupervised

Note: all methods can be applied recursively


Simple Discretization: Binning
• Top-down splitting technique based on a specified number of bins.
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid.
– If A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approx. same number of
samples
– Good data scaling
– Managing categorical attributes can be tricky
Example Binning for Data Smoothing
Sorted data for price (in $): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equal-depth) bins:


- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:


- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Unsupervised Discretization:
Binning vs Clustering

Data Equal width (distance)


binning

Equal depth (frequency) K-means clustering leads to better


Concept Hierarchy
• Concept hierarchy organized concepts (i.e., attribute values)
hierarchically.
• Concept hierarchy formation: recursively reduce the data by
collecting and replacing low level concepts by higher level concepts.
• Can be automatically formed for both numeric and nominal data
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the dataset.
– Attribute with the most distinct values is placed at the lowest level of the
hierarchy.
– Exception, e.g., weekday, month, quarter, year.

country 15 distinct values

province_or_ state 365 distinct


values
city 3567 distinct values

street 674,339 distinct values


Data Compression
• Data reduction techniques that transform the input data to a reduced
representation that is much smaller in volume, yet closely maintains the
integrity of the original data.
• Data compression types:
– Lossless: the original data can be reconstructed from the compressed data
without any information loss
– Lossy: reconstruct only an approximation of the original data.
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible without expansion.
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be constructed without reconstructing
the whole.
Sampling
• Sampling: obtaining a small sample s to represent the whole dataset
N.

• Key principle: choose a representative subset of the data


– Simple random sampling may have very poor performance in the
presence of skew
– Develop adaptive sampling methods: e.g., stratified sampling
Types of Sampling Raw Data

• Simple random sampling: equal probability


of selecting any particular item.
• SRSWOR:
– Once an object is selected, it is removed from
the population
• SRSWR:
– A selected object is not removed from the
population Stratified sampling
• Stratified sampling:
– Partition (or cluster) the dataset, and draw
samples from each partition (proportionally,
i.e., approx. the same percentage of the data)
Data Reduction

41
Informatics Engineering | Universitas Surabaya
Dimensionality Reduction
• The process of reducing the number of attributes or features under
consideration.

• Dimensionality reduction methodologies:


– Feature selection: find a subset of the original attributes
– Feature extraction: transform data in high-dimensional space to a space of
fewer dimensions.

• Some typical dimensionality reduction methods:


– Principal Component Analysis
– Attribute Subset Selection
– Nonlinear Dimensionality Reduction Methods
Why dimensionality reduction?
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse.
– The possible combination of subspaces will grow exponentially.

• Advantages of dimensionality reduction:


– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization.
Attribute Subset Selection
• Reduces the dataset size by removing irrelevant or redundant attributes.
– Focus on the relevant dimensions.

• Benefit:
– It reduces the number of attributes appearing in the discovered patterns, helping to
make the patterns easier to understand.

• Redundant attributes:
– Duplicate much or all of the information contained in one or more other attributes.
– Example: purchase price of a product and the amount of sales tax paid

• Irrelevant attributes:
– Contain no information that is useful for the data mining task.
– Example: a student’s ID is often irrelevant to the task of predicting his/her GPA
Attribute Subset Selection (Greedy)
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a dataset more effectively than the original ones.
• Three general methodologies:
– Attribute extraction: domain-specific
– Mapping data to new space
• e.g., Fourier transformation, wavelet transformation, manifold approaches
– Attribute construction
• Combining features
• Data discretization
Principal Component Analysis (PCA)
• PCA: a statistical procedure that uses an
orthogonal transformation to convert a set
of observations of possibly correlated
variables into a set of values of linearly
uncorrelated variables called principal
components.
• The original data are projected onto a
much smaller space, resulting in
dimensionality reduction.
• Method:
– Find the eigenvectors of the covariance
matrix,
– And these eigenvectors define the new
space.
Ball travels in a straight line. Data from
three cameras contain much redundancy
PCA Method
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) best used to represent data.
– Normalize input data: each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors.
– The principal components are sorted in order of decreasing significance or
strength.
– Since the component are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, to reconstruct a good approx. of the original
data).
• Works for numeric data only.
Exercises

54
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Normalization
A recorded data of age D = {13, 15, 16, 16, 19, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70} is used for
analysis.

• Use min-max normalization to transform the values in to range [0, 1].


• Use z-score normalization to transform the values.
• Use normalization by decimal scaling to transform the values.
• Comment comparing the normalization methods above as to why on
which you would prefer to use for the given data.
Question?

57
Informatics Engineering | Universitas Surabaya

You might also like