0% found this document useful (0 votes)

4 views43 pages

Week 2 - Data Quality

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views43 pages

Week 2 - Data Quality

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

1604C331 Data Mining

Week 2:
Data Quality

Odd Semester 2024-2025

20102620240905
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Data Quality
• Poor data quality negatively affects many data processing efforts.

• Data mining example:

– a classification model for detecting people who are loan risks is built
using poor data:
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default
What is Data Preprocessing?
Major Tasks
• Data Cleaning
– Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data Integration
– Integration of multiple databases, data cubes, or files
• Data Transformation & Data Discretization
– Normalization
– Concept hierarchy generation
• Data Reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
Data Quality Measures
• Accuracy: correct or wrong, accurate or not

• Completeness: not recorded, unavailable

• Consistency: some modified but some not, dangling

• Timeliness: timely update?

• Believability: how trustable the data are correct?

• Interpretability: how easily the data can be understood?

Data Cleaning

10
Informatics Engineering | Universitas Surabaya
Data is Dirty
• Real world data are dirty: lots of potentially incorrect (instrument faulty,
human/computer error, transmission error) – incomplete, noisy, inconsistent.
• Incomplete
– Lacking: attribute values, attributes of interest; or containing only aggregrate data
– e.g., Occupation = “” (missing data)
• Noisy
– Containing noise, errors, or outliers
– e.g., Salary = -10 (an error)
• Inconsistent
– Containing discrepancies in codes or names
– Age = 42, or use Birthday = “03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
– Disrepancy between duplicate records
• Intentional (e.g., disguised missing data)
– Jan 1 as everyone’s birthday?
Data Cleaning (Data Cleansing)
• Attempt to fill in missing values
• Smooth out noise while identifying outliers
• Correct inconsistencies in the data
Data Cleaning
Incomplete (missing) values
• Data is not always available:
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children
• Missing data may be due to:
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data were not entered due to misunderstanding
– Certain data may not be considered important at the time of entry
– Did not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple:
– Example:
• the class label is missing
• the percentage of missing values per attribute varies considerably.
– Not make use of the remaining attribute’s values in the tuple
• Fill in the missing value manually: tedious + infeasible for a large dataset with
many missing values.
• Fill in it automatically with:
– A global constant, e.g., “unknown” – not foolproof
– A measure of central tendency (mean or median)
– The attribute mean or median for all samples belonging to the same class as the given
tuple: smarter
– The most probable value: may be determined with regression, Bayesian formula,
decision tree induction.
Noisy Data
• What is noise? Random error or variance in a measured variable,
modification of original values.
• Examples:
– Distortion of a person’s voice when talking on a poor phone and “snow”
on television screen.
How to Handle Noisy Data?
• Binning: to smooth a sorted data value by consulting the values
around it.
– First sort data and partition into (equal-frequency) bins
– Then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
– Smooth by fitting the data into regression functions
• Outlier analysis: clustering
– Detect and remove outliers
• Semi-supervised: combined computer and human inspection
– Detect suspicious values and check by human
Data Integration

17
Informatics Engineering | Universitas Surabaya
Data Integration
• Combining data from multiple sources into a coherent data store.
• Why data integration?
– Help reduce/avoid noise
– Get more complete picture
– Improve mining speed and quality
• Challenges:
– semantic heterogeneity and
– structure of data
Data Integration Problems
• Entity identification problem: to match schema and objects from diff
sources.
– e.g., A.cust_id  B.cust_no
– Integrate metadata from different sources
• Redudancy and correlation analysis: to spot correlated numerical
and nominal data
• Tuple duplication
• Data value conflict detection and resolution
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases.
– Object identification: the same attribute or object may have different
names in different databases
– Derivable data: one attribute may be a derived attribute in another
attribute, e.g., annual revenue
• What’s the problem?
– 𝑌 = 2𝑋 → 𝑌 = 𝑋1 + 𝑋2 𝑌 = 3𝑋1 − 𝑋2 𝑌 = −1291𝑋1 + 1293𝑋2
• Redundant attributes may be detected by correlation analysis and
covariance analysis.
Data Transformation

25
Informatics Engineering | Universitas Surabaya
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values.
• Methods:
– Smoothing: remove noise from data
– Attribute/feature construction: new attributes constructed from the given ones.
– Aggregation: summarization, data cube construction
– Normalization: scaled to fall within a smaller, specified range
• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling
– Discretization: concept hierarchy climbing
Normalization
• Attribute data are scaled so as to fall within a smaller range, such as
-1 to +1, or 0 to 1.
• Normalizing the data attempts to give all attributes an equal weight.
• Example case:
– Classification using NN backpropagation algo: normalizing input values
help speed up the learning phase.
– Distance-based methods: prevent attributes with large ranges from
outweighing attributes with smaller ranges.
– No prior knowledge of the data.
Min-max normalization
• [minA, maxA] to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Example: Let income range $12,000 to $98,000 normalized to [0, 1]
– Then $73,000 is mapped to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
Z-score normalization
• Zero-mean normalization: the distance between the raw value and
the population mean in the unit of the standard deviation.

• Example: Let μ = 54,000, σ = 16,000.

– Then $73,600 is transformed to:

73,600 − 54,000
= 1.225
16,000
Normalization by decimal scaling
• Normalizes by moving the decimal point of values of an attribute.
v
v' = j where j is the smallest integer such that Max(|ν’|) < 1
10

• Example: Let recorded values range -986 to 917

– Then -986 is normalized to -0.986 and 917 is normalized to 0.917
Discretization
• Replaces the raw values of a numeric attribute by interval labels or
conceptual labels.
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs Unsupervised
– Split/Cut (top-down) vs Merge (bottom-up)
– Discretization can be performed recursively on an attribute.
Data Discretization Methods
• Binning
– Top-down split, unsupervised
• Histogram analysis
– Top-down split, unsupervised
• Clustering analysis
– Top-down split or bottom-up merge, unsupervised
• Decision-tree analysis
– Top-down split, supervised
• Correlation analysis
– Bottom-up merge, unsupervised

Note: all methods can be applied recursively

Simple Discretization: Binning
• Top-down splitting technique based on a specified number of bins.
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid.
– If A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approx. same number of
samples
– Good data scaling
– Managing categorical attributes can be tricky
Example Binning for Data Smoothing
Sorted data for price (in $): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equal-depth) bins:

- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Unsupervised Discretization:
Binning vs Clustering

Data Equal width (distance)

binning

Equal depth (frequency) K-means clustering leads to better

Concept Hierarchy
• Concept hierarchy organized concepts (i.e., attribute values)
hierarchically.
• Concept hierarchy formation: recursively reduce the data by
collecting and replacing low level concepts by higher level concepts.
• Can be automatically formed for both numeric and nominal data
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the dataset.
– Attribute with the most distinct values is placed at the lowest level of the
hierarchy.
– Exception, e.g., weekday, month, quarter, year.

country 15 distinct values

province_or_ state 365 distinct

values
city 3567 distinct values

street 674,339 distinct values

Data Compression
• Data reduction techniques that transform the input data to a reduced
representation that is much smaller in volume, yet closely maintains the
integrity of the original data.
• Data compression types:
– Lossless: the original data can be reconstructed from the compressed data
without any information loss
– Lossy: reconstruct only an approximation of the original data.
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless, but only limited manipulation is possible without expansion.
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be constructed without reconstructing
the whole.
Sampling
• Sampling: obtaining a small sample s to represent the whole dataset
N.

• Key principle: choose a representative subset of the data

– Simple random sampling may have very poor performance in the
presence of skew
– Develop adaptive sampling methods: e.g., stratified sampling
Types of Sampling Raw Data

• Simple random sampling: equal probability

of selecting any particular item.
• SRSWOR:
– Once an object is selected, it is removed from
the population
• SRSWR:
– A selected object is not removed from the
population Stratified sampling
• Stratified sampling:
– Partition (or cluster) the dataset, and draw
samples from each partition (proportionally,
i.e., approx. the same percentage of the data)
Data Reduction

41
Informatics Engineering | Universitas Surabaya
Dimensionality Reduction
• The process of reducing the number of attributes or features under
consideration.

• Dimensionality reduction methodologies:

– Feature selection: find a subset of the original attributes
– Feature extraction: transform data in high-dimensional space to a space of
fewer dimensions.

• Some typical dimensionality reduction methods:

– Principal Component Analysis
– Attribute Subset Selection
– Nonlinear Dimensionality Reduction Methods
Why dimensionality reduction?
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse.
– The possible combination of subspaces will grow exponentially.

• Advantages of dimensionality reduction:

– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization.
Attribute Subset Selection
• Reduces the dataset size by removing irrelevant or redundant attributes.
– Focus on the relevant dimensions.

• Benefit:
– It reduces the number of attributes appearing in the discovered patterns, helping to
make the patterns easier to understand.

• Redundant attributes:
– Duplicate much or all of the information contained in one or more other attributes.
– Example: purchase price of a product and the amount of sales tax paid

• Irrelevant attributes:
– Contain no information that is useful for the data mining task.
– Example: a student’s ID is often irrelevant to the task of predicting his/her GPA
Attribute Subset Selection (Greedy)
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a dataset more effectively than the original ones.
• Three general methodologies:
– Attribute extraction: domain-specific
– Mapping data to new space
• e.g., Fourier transformation, wavelet transformation, manifold approaches
– Attribute construction
• Combining features
• Data discretization
Principal Component Analysis (PCA)
• PCA: a statistical procedure that uses an
orthogonal transformation to convert a set
of observations of possibly correlated
variables into a set of values of linearly
uncorrelated variables called principal
components.
• The original data are projected onto a
much smaller space, resulting in
dimensionality reduction.
• Method:
– Find the eigenvectors of the covariance
matrix,
– And these eigenvectors define the new
space.
Ball travels in a straight line. Data from
three cameras contain much redundancy
PCA Method
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) best used to represent data.
– Normalize input data: each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors.
– The principal components are sorted in order of decreasing significance or
strength.
– Since the component are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, to reconstruct a good approx. of the original
data).
• Works for numeric data only.
Exercises

54
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Exercise 1: Normalization
A recorded data of age D = {13, 15, 16, 16, 19, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70} is used for
analysis.

• Use min-max normalization to transform the values in to range [0, 1].

• Use z-score normalization to transform the values.
• Use normalization by decimal scaling to transform the values.
• Comment comparing the normalization methods above as to why on
which you would prefer to use for the given data.
Question?

57
Informatics Engineering | Universitas Surabaya

PDF
0% (1)
PDF
1 page
000 - MX 15 19 - sn20600 49999 - D.P
No ratings yet
000 - MX 15 19 - sn20600 49999 - D.P
48 pages
Automotive ECU SW Function Development Chart Template
100% (1)
Automotive ECU SW Function Development Chart Template
21 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Examen y Respuestas Mtcre
100% (1)
Examen y Respuestas Mtcre
3 pages
Welding Cost Estimation
No ratings yet
Welding Cost Estimation
19 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
DELL - LATITUDE - E6500 - COMPAL - LA-4041P (Diagramas - Com.br)
No ratings yet
DELL - LATITUDE - E6500 - COMPAL - LA-4041P (Diagramas - Com.br)
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Mastercam - X4 - Art Training Tutorial
No ratings yet
Mastercam - X4 - Art Training Tutorial
28 pages
Data Mining
No ratings yet
Data Mining
31 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unit 2
No ratings yet
Unit 2
37 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Department of Electrical/Elcetronic Engineering
No ratings yet
Department of Electrical/Elcetronic Engineering
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Normalization
No ratings yet
Normalization
35 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit - II
No ratings yet
Unit - II
56 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
"Iot Enabled Hydrponics Monitoring System": JNN College of Engineering
No ratings yet
"Iot Enabled Hydrponics Monitoring System": JNN College of Engineering
30 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
DWDM Unit 4 PDF
No ratings yet
DWDM Unit 4 PDF
18 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
Brochure Rosemount Level Measurement Solutions en 76356
No ratings yet
Brochure Rosemount Level Measurement Solutions en 76356
20 pages
IITKGP Induction Handbook
No ratings yet
IITKGP Induction Handbook
52 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
103 pages
850c Display Manual Biktrix Version
No ratings yet
850c Display Manual Biktrix Version
9 pages
Spring & Hibernate
No ratings yet
Spring & Hibernate
37 pages
Hisagent User
No ratings yet
Hisagent User
432 pages
Mini Max Algorithm
No ratings yet
Mini Max Algorithm
31 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Case Study UIUX Sumit B - Designerrs
No ratings yet
Case Study UIUX Sumit B - Designerrs
37 pages
Writing A Resume in English
100% (1)
Writing A Resume in English
6 pages
English 1st
No ratings yet
English 1st
2 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Student Login Details For PG SEM I 2023 - GEOGRAPHY
No ratings yet
Student Login Details For PG SEM I 2023 - GEOGRAPHY
36 pages
Basic Computer Architecture
No ratings yet
Basic Computer Architecture
33 pages
Search and Destroy 1st Edition JT Sawyer 2024 Scribd Download
100% (4)
Search and Destroy 1st Edition JT Sawyer 2024 Scribd Download
55 pages
Migrating A Survey From LimeSurvey To Qualtrics
No ratings yet
Migrating A Survey From LimeSurvey To Qualtrics
11 pages
Faq - Ontap - Data Ontap Log Overview
No ratings yet
Faq - Ontap - Data Ontap Log Overview
8 pages
Solution 2 28122022 013134am
No ratings yet
Solution 2 28122022 013134am
5 pages
Website Contents Details
No ratings yet
Website Contents Details
5 pages
Milestone 2
No ratings yet
Milestone 2
5 pages
Invitation Letter - Summer School - 03-14 June 2024
No ratings yet
Invitation Letter - Summer School - 03-14 June 2024
1 page
Devops: Roadmap - SH
No ratings yet
Devops: Roadmap - SH
1 page
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet

Week 2 - Data Quality

Uploaded by

Week 2 - Data Quality

Uploaded by

1604C331 Data Mining

Odd Semester 2024-2025

• Data mining example:

• Completeness: not recorded, unavailable

• Consistency: some modified but some not, dangling

• Timeliness: timely update?

• Believability: how trustable the data are correct?

• Interpretability: how easily the data can be understood?

• Example: Let μ = 54,000, σ = 16,000.

• Example: Let recorded values range -986 to 917

Note: all methods can be applied recursively

* Partition into equal-frequency (equal-depth) bins:

* Smoothing by bin means:

Data Equal width (distance)

Equal depth (frequency) K-means clustering leads to better

country 15 distinct values

province_or_ state 365 distinct

street 674,339 distinct values

• Key principle: choose a representative subset of the data

• Simple random sampling: equal probability

• Dimensionality reduction methodologies:

• Some typical dimensionality reduction methods:

• Advantages of dimensionality reduction:

• Use min-max normalization to transform the values in to range [0, 1].

You might also like