0% found this document useful (0 votes)

21 views21 pages

ISE233 Lecture 3

The document discusses preprocessing operational data from industrial systems. It covers topics like data cleaning, integration, reduction, and transformation which includes cleaning noisy and missing data, reducing dimensions, and transforming data through normalization, discretization, and other techniques.

Uploaded by

sindhura2258

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views21 pages

ISE233 Lecture 3

Uploaded by

sindhura2258

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Operational Data Analysis for

Industrial Systems
(Lecture 3)
Synopsis
• Data Preprocessing
✓ Data cleaning
✓ Data integration
✓ Data reduction
✓ Data transformation
Data Preprocessing
• Real word data are highly susceptible to noisy, missing, and inconsistent data.

• Improving the data quality

• Data quality – accuracy, completeness, consistency, timeliness, believability, and

interpretability
Major Task in Data Preprocessing
1. Data cleaning – filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies

2. Data integration – include and integrate data from multiple sources in

data analysis

3. Data reduction – reduce complexity of the data

• Dimensionality reduction – reduce representation
• Numerosity reduction – replaced by parametric model
(regression or log-linear models) or nonparametric model
(histograms, cluster, sampling)

4. Data transformation – normalization, discretization, concept hierarchy

generation
Major Task in Data Preprocessing
Data Cleaning
Missing Value

1. Ignore the data point - usually done when class label is missing (not effective)
2. Filling in the missing value manually – time consuming
3. Use a global constant to fill in the missing value – Unknown or -∞
4. Use a measure of central tendency for the attribute – mean or median
5. Use the attribute mean or median for all sample belonging to the same class as the given
tuple
6. Use the most probable value to fill in the missing value
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable

• Binning: binning method smooth a sorted data value by consulting its “neighborhood”, that is,
the values around it
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable

• Regression
• Outlier analysis – outliers may be detected by clustering
Data Integration
Redundancy and Correlation Analysis

• Redundancy – an attribute may be redundant if it can be derived from another attribute or a

set of attribute
• Some redundancy can be detected by correlation analysis
✓ 𝜒 2 correlation test for nominal data
✓ Correlation coefficient for numeric data
✓ Covariance of numeric data
Data Integration
𝜒 2 Correlation Test for Nominal Data

• Suppose A has c distinct values, 𝑎1 , 𝑎2 , … , 𝑎𝑐 ; B has r distinct values, 𝑏1 , 𝑏2 , … , 𝑏𝑟

c r (oij − eij ) 2
2 = 
𝑐𝑜𝑢𝑛𝑡(𝐴 = 𝑎𝑖 ) × 𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑎𝑗 )
𝑒𝑖𝑗 =
i =1 j =1 eij 𝑛

• 𝑜𝑖𝑗 is the observed frequency B

• 𝑒𝑖𝑗 is the expected frequency 𝑏1 𝑏2 … 𝑏𝑟 Total

• n is number of data points 𝑎1

A 𝑎2
• degree of freedom = (r-1)(c-1)
⋮
𝑎𝑐
Total
Data Integration
Correlation Coefficient for Numeric Data

• Correlation coefficient (Pearson’s product moment coefficient)

n n
 (ai − A)(bi − B )  (aibi ) − nAB
rA, B = i =1 = i =1
n A B n A B

• n is number of data points

• 𝑎𝑖 and 𝑏𝑖 are values for A and B
• 𝐴ҧ and 𝐵ത are mean values of A and B
• 𝜎𝐴 and 𝜎𝐵 are standard deviation of A and B
• 𝑟𝐴,𝐵 in the range of [-1, +1]
Data Integration
Covariance of Numeric Data


n
a
i =1 i
E ( A) = A =
n

n
b
i =1 i
E ( B) = B =
n

n
=
(ai − A)(bi − B )
Cov( A, B ) = E (( A − A)( B − B )) = i 1
n
Cov( A, B )
rA, B =
 A B
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Attribute Subset Selection
Data Reduction
• Regression and Log-linear Models: Parametric Data Reduction
• Histograms
• Clustering
• Sampling
Data Transformation
Data Transformation by Normalization
• Min-max normalization – a linear transformation; Suppose that 𝑚𝑖𝑛𝐴 and 𝑚𝑎𝑥𝐴 are the minimum and
maximum values of an attribute. A min-max normalization maps a value, 𝑣𝑖 , of A to 𝑣𝑖′ in the range
[new_𝑚𝑖𝑛𝐴 , new_𝑚𝑎𝑥𝐴 ] by computing
vi − min A
vi' = (new _ max A − new _ min A ) + new _ min A
max A − min A

• Z-score normalization
vi − A
vi' =
A

• Decimal scaling normalization: normalization by moving the decimal point of values of attribute A

vi
vi' =
10 j
Data Transformation
Data Transformation by Discretization - The raw values of a numeric attribute
are replaced by interval or conceptual labels

• Discretization by binning
• Discretization by histogram analysis
• Discretization by cluster
• Discretization by decision tree
• Discretization by correlation analysis

ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Week2 2
No ratings yet
Week2 2
25 pages
CST322 Module2 Extra
No ratings yet
CST322 Module2 Extra
32 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Internship Report
No ratings yet
Internship Report
34 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Unit 3
No ratings yet
Unit 3
164 pages
Unit 3
No ratings yet
Unit 3
41 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
ML 4
No ratings yet
ML 4
17 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Mit Data Science Bootcamp PDF
No ratings yet
Mit Data Science Bootcamp PDF
12 pages
Shyama Yantra Puja English
80% (5)
Shyama Yantra Puja English
20 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
An Introduction To Oauth2
100% (1)
An Introduction To Oauth2
77 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Python Data Structure - Quick Guide
No ratings yet
Python Data Structure - Quick Guide
94 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining
No ratings yet
Data Mining
5 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
HW 1 Solutions
No ratings yet
HW 1 Solutions
4 pages
HW 2 Solution
No ratings yet
HW 2 Solution
3 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Sales Order Demo Policy
No ratings yet
Sales Order Demo Policy
4 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Security Architecture of The Healthcare Cyber Physical System
No ratings yet
Security Architecture of The Healthcare Cyber Physical System
1 page
DWDM 3
No ratings yet
DWDM 3
12 pages
Normalization
No ratings yet
Normalization
35 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
State Management: Today You Will Learn
No ratings yet
State Management: Today You Will Learn
24 pages
Reg No.: Manipal Institute of Technology (Constituent Institute of Manipal University) MANIPAL-576104
No ratings yet
Reg No.: Manipal Institute of Technology (Constituent Institute of Manipal University) MANIPAL-576104
1 page
Chapter 8 - Code Generation
No ratings yet
Chapter 8 - Code Generation
22 pages
Source - Size Fread (Source - STR, 1, MAX - SOURCE - SIZE, FP) : "Sort - Desc"
No ratings yet
Source - Size Fread (Source - STR, 1, MAX - SOURCE - SIZE, FP) : "Sort - Desc"
1 page
MBCE701D-Economics & Management Decisions-Jan 20-Assignment1
No ratings yet
MBCE701D-Economics & Management Decisions-Jan 20-Assignment1
13 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Purnanandalahari p4D8
67% (3)
Purnanandalahari p4D8
100 pages
Edureka CAS Brochure PDF
No ratings yet
Edureka CAS Brochure PDF
15 pages

ISE233 Lecture 3

Uploaded by

ISE233 Lecture 3

Uploaded by

Operational Data Analysis for

• Improving the data quality

• Data quality – accuracy, completeness, consistency, timeliness, believability, and

2. Data integration – include and integrate data from multiple sources in

3. Data reduction – reduce complexity of the data

4. Data transformation – normalization, discretization, concept hierarchy

• Redundancy – an attribute may be redundant if it can be derived from another attribute or a

• Suppose A has c distinct values, 𝑎1 , 𝑎2 , … , 𝑎𝑐 ; B has r distinct values, 𝑏1 , 𝑏2 , … , 𝑏𝑟

• 𝑜𝑖𝑗 is the observed frequency B

• n is number of data points 𝑎1

• Correlation coefficient (Pearson’s product moment coefficient)

• n is number of data points

You might also like