0% found this document useful (0 votes)

51 views62 pages

Topic 05 - Data Preprocessing

Topic 05

Uploaded by

Sơn Nguyễn Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views62 pages

Topic 05 - Data Preprocessing

Topic 05

Uploaded by

Sơn Nguyễn Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

University of Science, VNU-HCM

Faculty of Information Technology

Môn Cơ Sở Trí Tuệ Nhân Tạo

Introduction to Data Science Course

Data Preprocessing

Le Ngoc Thanh
[email protected]
Department of Computer Science

Ho Chi Minh City

Contents
◎ Why need to preprocess data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data
◎ Attribute (Key) - Value
◎ Data types
○ numeric, categorical
○ static, dynamic (time)
◎ Other data types
○ Distributed data
○ Text data
○ Web data, metadata
○ Pictures, audio / video
○ ....

fit@hcmus
Data quality
◎ Missing, incomplete: missing attribute value, missing attributes
of interest, or only contains integrated data
○ Example : age, weight = “ ”
◎ Noise: contain errors or outliers
○ Example: salary =“-100 000”
◎ Conflict: there is inconsistency in the code or in the name
○ Example: age =42 , birth = 03/07/1997; US=USA?

fit@hcmus
Consequences of data quality
◎ The right decision must be based on accurate data
○ For example, duplication or lack of data can lead to inaccurate
statistics, or even misleading.
◎ Data warehouse needs consistent integration of quality
data

"Poor quality data -> not good exploitation"

fit@hcmus
Solutions? (1/2)

fit@hcmus
Solutions? (2/2)
◎ Data Cleaning
○ Fill in missing values, eliminate noise data, identify and eliminate discrepancies,
noise data, and resolve conflicting data
◎ Data Intergration
○ Synthesize, integrate DL from many databases, different files.
◎ Data Transformation
○ Aggregation.
◎ Data Reduction
○ Reduce the data size but ensure analytical results.

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data cleaning
◎ Data cleaning is the most important task
◎ Data cleaning is the process:
○ Fill in the missing values
○ Identify and eliminate noise data
○ Resolve conflicting data

fit@hcmus
Fill the missing value(1/2)
◎ Delete missing items:
○ Commonly used when class labels are missing (in classification)
○ Ease, but not efficiency, especially when the ratio of missing values is
high.
◎ Fill in missing values manually: tasteless and not feasible
◎ Fill missing values automatically:
○ Replaced by a common constant. For example, "don't know". Can
become new class in data

fit@hcmus
Fill the missing value (2/2)
◎ Fill missing values automatically:
○ Replaced with the property's mean
○ Replaced with the property’s mean in a class
○ Replace with the most likely value: infer from a Bayesian formula,
decision tree or EM algorithm (Expectation Maximization)

fit@hcmus
Noise reduction
◎ The basic methods of noise reduction:
○ Binning method:
◉ Sort and divide data into equal-width or equal-depth bins
◉ Noise reduction by mean, median, margin, ...
○ Clustering method:
◉ Detect and remove outliers
○ Regression method:
◉ Fit data into the regression function

fit@hcmus
Noise reduction– Binning (1/4)
◎ Binning method
○ Divide data into equal-width bins:
◉ Divide the range of values into N about the same size
◉ The width of each interval = (maximum value - minimum value) / N
○ Divide data into equal-depth bins:
◉ Divide the range of values into N ranges that each contain approximately the
same number of samples

fit@hcmus
Noise reduction– Binning (2/4)
◎ Example about equal-width:
The temperature value with N = 7:
64 65 68 69 70 71 72 72 75 75 80 81 83 85

Count

4
2 2 2 0 2 2

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

Left Bound <= value < Right Bound
s in to N in te rvals.
Divide the range o
f v a lu e
lu e - min imu m v a lue) / N.
a c h in te rv a l = (m aximum va
fit@hcmus The width of e
Noise reduction– Binning (3/4)
◎ But not good for skewed data

Count

[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in the Company
fit@hcmus
Noise reduction– Binning(4/4)
◎ Example about equal-depth:
The temperature value with N = 4:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count

4 4 4
2

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Depth = 4, except for the last bin

Divide the range of values into N ranges that
each contain approximately the same number of
fit@hcmus
samples
Noise reduction with split bins
◎ Sorted prices:
4, 8, 15, 21, 21, 24, 25, 28, 34
◎ Divide data into an equal-depth bins with N = 3
○ Bin 1: 4, 8, 15
○ Bin 2: 21, 21, 24
○ Bin 3: 25, 28, 34
→ What to do with the split bins?

fit@hcmus
Noise reduction with split bins
○ Bin 1: 4, 8, 15
○ Bin 2: 21, 21, 24
○ Bin 3: 25, 28, 34 Smoothing by median:
- Bin 1: 8, 8, 8
- Bin 2: 21, 21, 21
- Bin 3: 28, 28, 28

Smoothing by mean: Smoothing by margin:

- Bin 1: 9, 9, 9 - Bin 1: 4, 4, 15
- Bin 2: 22, 22, 22 - Bin 2: 21, 21, 24
- Bin 3: 29, 29, 29 - Bin 3: 25, 25, 34

fit@hcmus
Exercises
◎ Prices :
15, 17, 19, 25, 29, 31, 33, 41, 42, 45, 45, 47, 52, 52, 64
◎ Use the binning method with equal-width and equal-depth with
four bins:
○ Calculate the value of the bin according to the median smoothing.
○ Calculate the value of the bin according to the margin smoothing.
○ Calculate the value of the bin according to the mean smoothing.
○ Give some comments on results.

fit@hcmus
Noise reduction?
◎ The basic methods of noise reduction:
○ Binning method:
◉ Sort and divide data into equal-width or equal-depth bins
◉ Noise reduction by mean, median, margin, ...
○ Clustering method:
◉ Detect and remove outliers
○ Regression method:
◉ Fit data into the regression function

fit@hcmus
Noise reduction – clustering

fit@hcmus
Noise reduction – regression

Y1’ y=x+1

X1 x

fit@hcmus
Resolve conflicts
◎ How to handle conflicting data?
◎ Give examples of each conflict resolution method.

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data Integration
◎ Select and aggregate data from many different sources
into one database
◎ What problems occur when selecting and aggregating
data?

fit@hcmus
Data integration process (1/4)
◎ Process:
○ Select only required data for the data mining process.
○ Matches the data schema
○ Eliminate redundant and duplicate data
○ Detect and resolve data inconsistencies

fit@hcmus
Data integration process (2/4)
◎ Schema Matching
○ Entity recognition problem
◉ How do entities from multiple data sources become relevant
◉ US=USA; customer_id = cust_number
○ Metadata

fit@hcmus
Data integration process (3/4)
◎ Eliminate redundant and duplicated data
○ An attribute is redundant if it can be inferred from other properties
○ The same property can have multiple names in different databases
○ Some records in the data are repeated
○ Use correlation analysis
◉ r=0: X and Y are not correlated
◉ r>0: positive correlation. Xá«Yá
◉ r<0: negative correlation . Xâ« Y á

fit@hcmus
Data integration process (4/4)
◎ Resolve inconsistencies in data
○ For example, weight is measured in kilograms or pounds
○ Define standards and mapping based on metadata

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data reduction
◎ The data may be too large for some data mining applications:
time consuming.
◎ Data reduction is the process of reducing data (size) so that
the same (or almost the same) analysis result is obtained.

fit@hcmus
Methods of data reduction
◎ Methods:
○ Aggregation
○ Dimensionality reduction
○ Data compression
○ Numerosity reduction
○ Discretization and Concept hierarchies

fit@hcmus
Data reduction – Aggregation (1/3)
◎ Aggregation
○ Combination of 2 or more attributes (object) into 1 attribute (object)
◉ Example: cities integrated into regions, regions and water, …
○ Aggregate low-level data into high-level data:
◉ Decrease data set size: reduce the number of attributes
◉ Increase the interestingness of the sample

fit@hcmus
Data reduction – Aggregation (2/3)

fit@hcmus
Data reduction – Aggregation (3/3)

fit@hcmus
Data reduction – Dimensionality reduction (1/6)
◎ Dimensionality reduction
○ Feature selection (subset of attributes)
◉ Choose m from n attributes
◉ Remove irrelevant, redundant attributes
○ How to define irrelevant attributes?
◉ Statistics
◉ Information gain

fit@hcmus
Data reduction – Dimensionality reduction (2/6)
◎ How to reduce the data dimension?
○ Brute Force
◉ There are 2d attribute subsets of d attributes
◉ Computational complexity is too high
○ Heuristic method
◉ Stepwise forward selection
◉ Stepwise backward elimitation
◉ Combine two methods
◉ Inductive decision tree

fit@hcmus
Data reduction – Dimensionality reduction (3/6)
◎ Heuristic - Stepwise forward
○ Step 1: choose the best single attribute
○ Step 2: Choose the best attribute from the rest,...
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
○ Result ={}
◉ S1: Result = {A1}
◉ S2: Result = {A1,A4}
◉ S3: Result = {A1,A4,A6}

fit@hcmus
Data reduction – Dimensionality reduction (4/6)
◎ Heuristic - Stepwise backward
○ Step 1: removes the worst single attribute
○ Step 2: continues to remove the worst of the remaining attributes, …
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
○ Result ={A1,A2,A3,A4,A5,A6}
◉ S1: Result = {A1,A3,A4,A5,A6}
◉ S2: Result = {A1,A4,A5,A6}
◉ S3: Result = {A1,A4, A6}

fit@hcmus
Data reduction – Dimensionality reduction (5/6)
◎ Heuristic – Combine Forward and Backward
○ Step 1: select the best single attribute and the worst single attribute type
○ Continue to choose the best attribute and the worst attribute type among the rest, …
◎ Example with initial attribute set: {A1,A2,A3,A4,A5,A6}
○ Result = {A1,A2,A3,A4,A5,A6}
◉ S1: Result = {A1,A3,A4,A5,A6}
◉ S2: Result = {A1,A4,A5,A6}
◉ S3: Result = {A1,A4, A6}

fit@hcmus
Data reduction – Dimensionality reduction (6/6)
◎ Heuristic – Inductive decision tree
○ Step 1: build decision tree
○ Step 2: removes any properties that are not present on the tree
◎ Example with initial attribute set:
{A1,A2,A3,A4,A5,A6}
Þ Result = {A1, A4, A6} A4 ?

A1? A6?

Class 2 Class 1 Class 2

Class 1
fit@hcmus
Data reduction – Compression
◎ Data Compression:
○ Encrypt or transform data
○ Lossless compression
◉ Data can be recovered
○ Lossy compression
◉ Data cannot be fully recovered
○ Using wavelet transforms, principal component analysis (PCA), ...

fit@hcmus
Data reduction – Numerosity reduction
◎ Numerosity reduction: selects a different representation of
the data ("less than")
◎ Some methods:
○ Parameter method:
◉ Use a mathematical model to store parameters
◉ Regression model and log-linear
○ Non-parametric method:
◉ Do not use a mathematical model but save the reduced representation
◉ Graphs, grouping, sampling

fit@hcmus
Data reduction – Numerosity reduction
◎ Linear regression:Y = a + b X
◎ Multi linear regression: Y = b0 + b1 X1 + b2 X2
◎ Log-linear model:
○ Probaility: p(a, b, c, d) = aab bac cad dbcd

fit@hcmus
Data reduction – Numerosity reduction
◎ Histogram
○ Common methods for data reduction
○ Divide the data into bins and the height of the column is the number
of objects in each bin. Store only the average of each bin.
○ The shape of the chart depends on the number of bins

fit@hcmus
Data reduction – Numerosity reduction
◎ Clustering
○ Divide data into groups and save group representations.
○ Very effective if the data is grouped but vice versa when the data is scattered
○ Lots of clustering algorithms.

fit@hcmus
Data reduction – Numerosity reduction
◎ Sampling
○ Use a much smaller random sample set instead of large data set.
○ Simple random sample without replacement (SRSWOR)
○ Simple random sample with replacement (SRSWR)
○ Group / hierarchical sampling method

fit@hcmus
Data reduction – Numerosity reduction

W O R
SRS le random
p t
(sim le withou
samp ment)
p la ce
re

SRSW
R

Raw Data
fit@hcmus
Data reduction – Numerosity reduction

Raw Data Cluster/Stratified Sample

fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Discretization:
○ Converts the property value domain (contiguous) by dividing the
value domain into intervals.
○ Store labels of ranges instead of actual values
○ Suitable for continuous numeric data.
○ Methods: binning, chart analysis, grouping, discrete by entropy,
natural segmentation.

fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Concept hierarchies:
○ Gather and replace a low-level concept with a higher-level concept.
○ Suitable for non-numeric data: create a hierarchy.

fit@hcmus
fit@hcmus
Data reduction – Discretization and Concept hierarchies

◎ Example:
○ Converts the logical value to 1.0
○ Converts a date value to a number
○ Converts columns with large numeric values into a set of values in a smaller range,
for example dividing them by a certain factor.
○ Group of values has the same semantics as: Activity before August Revolution is
group 1; from 01/08/45 - 31/06/54; group 2; from 01/07/54 - 30/4/75 is group 3, ...
○ Substitute the value of age into young, middle-aged, old

fit@hcmus
Contents
◎ Why need to prepare data?
◎ Data cleaning
◎ Data integration
◎ Data reduction
◎ Data transformation

fit@hcmus
Data transformation
◎ Data transformation: convert data into a form that is
suitable and convenient for algorithms
◎ Data transformation process :
○ Smoothing
○ Aggregation
○ Generalization
○ Normalization
○ Attribute construction

fit@hcmus
Data transformation process
◎ Smoothing: the process of removing noise from the data.
◎ Integration: summarizing or integrating data.
◎ Generalization: replacing low-level concepts with high-level
concepts.
◎ Normalization: attribute data should be returned to a small
range of values like 0 to 1.
◎ Attribute construction: new properties are created and added
to a given set of properties

fit@hcmus
Conclusion
◎ Data is often missing, noisy, inconsistent, and
multidimensional.
◎ Good data is the key to creating reliable and valid models.
◎ Data preparation includes the following processes:
○ Cleaning
○ Selection
○ Reduction
○ Transformation

fit@hcmus
Exercises
◎ Why is preparing data so urgent and time-consuming?
◎ How to solve the problem of missing values in database
records?
◎ Assuming the database has Age attribute with the values in
the records (ascending): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70
○ Denoise data by mean of bins with the number of bins n = 4. Explain the
effectiveness of this technique with the above data.
○ Plot the equal-width histogram with the width = 10

fit@hcmus
Exercises
◎ Why do we need to select / integrate data? Please describe
the data selection process.
◎ Why need to data reduction? Can data reduction process lose
information? If yes, please state how to fix it.
◎ Learn about the data transformation processes. Give
examples for each direction.

fit@hcmus
fit@hcmus

Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Pre Processing I
No ratings yet
Data Pre Processing I
37 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Pre-Processing & Cleaning Guide
No ratings yet
Data Pre-Processing & Cleaning Guide
37 pages
Week2 2
No ratings yet
Week2 2
25 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Unit 2
No ratings yet
Unit 2
46 pages
DWDM Lecture PPT Unit3 Part3
No ratings yet
DWDM Lecture PPT Unit3 Part3
29 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Knowledge Discovery Database - Unit 2
No ratings yet
Knowledge Discovery Database - Unit 2
53 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit 2
No ratings yet
Unit 2
37 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Dunit I-Part-2
No ratings yet
Dunit I-Part-2
82 pages
Outliners
No ratings yet
Outliners
15 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Unit 3
No ratings yet
Unit 3
41 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
DWM
No ratings yet
DWM
14 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
ML 4
No ratings yet
ML 4
17 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
2 DMiningKuliah 2A DPreparation
No ratings yet
2 DMiningKuliah 2A DPreparation
32 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit - II
No ratings yet
Unit - II
56 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
One Token To Fool LLM-As-A-Judge
No ratings yet
One Token To Fool LLM-As-A-Judge
26 pages
Memoria
No ratings yet
Memoria
6 pages
Neural Multi-Objective Combinatorial Optimization Via Graph-Image Multimodal Fusion
No ratings yet
Neural Multi-Objective Combinatorial Optimization Via Graph-Image Multimodal Fusion
20 pages
Gold-Medalist Performance in Solving Olympiad Geometry With AlphaGeometry
No ratings yet
Gold-Medalist Performance in Solving Olympiad Geometry With AlphaGeometry
28 pages
Computer Vision Part2
No ratings yet
Computer Vision Part2
62 pages
AIO2024 Module02 Extra SQL Big Data
No ratings yet
AIO2024 Module02 Extra SQL Big Data
94 pages
Topic 03 - Basic Statistics
No ratings yet
Topic 03 - Basic Statistics
42 pages
Daphniphyllum Alkaloids Final MDP
No ratings yet
Daphniphyllum Alkaloids Final MDP
15 pages
Topic 02 - Data Collection
No ratings yet
Topic 02 - Data Collection
44 pages
03a-GP Organomet Cat
No ratings yet
03a-GP Organomet Cat
40 pages
Cylindrospermopsin Synthesis
No ratings yet
Cylindrospermopsin Synthesis
8 pages
Teruaki Mukaiyama - : Y. Ishihara Baran Lab Group Meeting
No ratings yet
Teruaki Mukaiyama - : Y. Ishihara Baran Lab Group Meeting
9 pages
AI Problem Solving by Search
No ratings yet
AI Problem Solving by Search
11 pages
MT105a Commentary 2022
No ratings yet
MT105a Commentary 2022
19 pages
Business Analytics, Global Edition James Evans PDF Download
No ratings yet
Business Analytics, Global Edition James Evans PDF Download
145 pages
Inter 2ndyear Maths IIB (EM) 2025 Guess Paper-1
100% (1)
Inter 2ndyear Maths IIB (EM) 2025 Guess Paper-1
2 pages
MAT301 Unit III Notes 2 Chi Square Test Statistics by Robert S Witte John S Witte
No ratings yet
MAT301 Unit III Notes 2 Chi Square Test Statistics by Robert S Witte John S Witte
21 pages
Unit 4 PPT Part2 - Pandas
No ratings yet
Unit 4 PPT Part2 - Pandas
40 pages
IMC Problem Solving Seminar Warwick 2015-16 Imc - Seminar - Jan20-2016
No ratings yet
IMC Problem Solving Seminar Warwick 2015-16 Imc - Seminar - Jan20-2016
2 pages
Discussion-2: Example: 11-3
No ratings yet
Discussion-2: Example: 11-3
4 pages
Ohm's Law & Resistor Analysis Guide
No ratings yet
Ohm's Law & Resistor Analysis Guide
2 pages
PHD Thesis Roffo 2016 171 PDF
No ratings yet
PHD Thesis Roffo 2016 171 PDF
245 pages
Isodraw Shortcut Keys - Consolidated
No ratings yet
Isodraw Shortcut Keys - Consolidated
3 pages
DAX Functions List Quick Reference
No ratings yet
DAX Functions List Quick Reference
29 pages
Determinants
No ratings yet
Determinants
25 pages
MATLAB DFT and FFT Computation Guide
No ratings yet
MATLAB DFT and FFT Computation Guide
5 pages
Lesson Plan 11 A
No ratings yet
Lesson Plan 11 A
3 pages
Hire & Purchase: - Calculation of Effective Rate of Interest
No ratings yet
Hire & Purchase: - Calculation of Effective Rate of Interest
13 pages
5.1-5.3 Graphing Trig Functions Notes
No ratings yet
5.1-5.3 Graphing Trig Functions Notes
9 pages
Unit-2 - Random Variables and Probability Distributions - Jan2025
No ratings yet
Unit-2 - Random Variables and Probability Distributions - Jan2025
136 pages
Aldebrez1997 Real TimeSimulationOfLiquidProcessPlant
No ratings yet
Aldebrez1997 Real TimeSimulationOfLiquidProcessPlant
5 pages
Class IX Session 2025-26 Subject - Mathematics Sample Question Paper - 5
No ratings yet
Class IX Session 2025-26 Subject - Mathematics Sample Question Paper - 5
19 pages
LP-With Answer Key
No ratings yet
LP-With Answer Key
4 pages
Islamic Mathematics
0% (1)
Islamic Mathematics
5 pages
18MAB203T U3 Book PDF
No ratings yet
18MAB203T U3 Book PDF
38 pages
Fizikf 4 K 1 My 10
No ratings yet
Fizikf 4 K 1 My 10
10 pages
Arithmetic and Geometric Means
No ratings yet
Arithmetic and Geometric Means
17 pages
2018-06-05
No ratings yet
2018-06-05
27 pages
Soil Mechanics: Compressibility & Consolidation
No ratings yet
Soil Mechanics: Compressibility & Consolidation
70 pages
CHAPTER 7 Heat Transfer
No ratings yet
CHAPTER 7 Heat Transfer
26 pages
Botes Et Al. (2022)
No ratings yet
Botes Et Al. (2022)
28 pages
Positioning of Object
No ratings yet
Positioning of Object
2 pages

Topic 05 - Data Preprocessing

Uploaded by

Topic 05 - Data Preprocessing

Uploaded by

University of Science, VNU-HCM

Faculty of Information Technology

Môn Cơ Sở Trí Tuệ Nhân Tạo

Ho Chi Minh City

"Poor quality data -> not good exploitation"

[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85]

[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]

Depth = 4, except for the last bin

Smoothing by mean: Smoothing by margin:

Class 2 Class 1 Class 2

Raw Data Cluster/Stratified Sample

You might also like