0% found this document useful (0 votes)

13 views41 pages

Lecture 02

Uploaded by

vkr2471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views41 pages

Lecture 02

Uploaded by

vkr2471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Data Preprocessing:

Concepts and Techniques

Lecture 02
Data Preprocessing

An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
Data Quality: Why Preprocess the Data?

Assessment of data quality: A comprehensive perspective

Precision: right or incorrect, precise or imprecise
Integrity: absent recordings, inaccessible, ...
Uniformity: certain alterations while others remain unchanged,
inconsistencies, ...
Punctuality: timely updates?
Credibility: the reliability of data accuracy?
Comprehensibility: the ease with which data can be grasped?
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
• Real Data: flawed in various way, e.g., instrument faulty, human or
computer error, transmission error
• Incomplete Data: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy Data: containing noise, errors, or outliers
• e.g., Age=“−10” (an error)
• Inconsistent Data: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional Data:(e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Missing Data
• Unavailable Data (sometimes)
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
Date Temperature Humidity Wind Speed
Weather Monitoring System:
(°C) (%) (km/h)
2023-07-01 25.0 60 15
2023-07-02 26.5 58 10
2023-07-03 N/A 55 20
2023-07-04 N/A 62 12
2023-07-05 27.0 59 18

Explanation: On July 3rd and 4th, the temperature sensor malfunctioned,

resulting in no temperature data being recorded for those days.
Missing data may be due to Missing Data
• inconsistent with other recorded data and thus deleted: Sales Database

Transaction ID Customer ID Product ID Quantity Price Total

001 1001 2001 2 $50.00 $100.00
002 1002 2002 -50 $30.00 N/A
003 1003 2003 3 $20.00 $60.00
004 1004 2004 1 $15.00 $15.00
Explanation: Transaction 002 was flagged and the total value was deleted because
it showed an impossible purchase quantity of -50 units, which is inconsistent with
other recorded data.

• data not entered due to misunderstanding

Hospital Patient Records: Patient ID Age Gender Smoking Status Diagnosis
001 50 Female N/A Diabetes
002 37 Male Non-smoker Asthma

Explanation: The "smoking status" field is missing for several patients because
a nurse misunderstood the form and left this field blank for those entries.
Missing Data
Missing data may be due to
• certain data may not be considered important at the time of entry
Customer Database:
Customer ID Name Email Occupation
1 Alice [email protected] N/A
2 Bob [email protected] N/A
3 Charlie [email protected] Engineer
4 Daisy [email protected] Teacher

Explanation: The company initially did not record customers' occupation, considering it
unimportant. Later, when analyzing customer demographics, this data was found missing for
initial entries.
• not register history or changes of the data
Product Inventory System: Date Product ID Product Name Price
2023-07-15 3001 Widget A $12
2023-08-01 3001 Widget A $15
2023-08-15 3002 Widget B $20
Explanation: The price of Widget A is updated regularly, but past prices are not stored. If
analysis requires historical pricing data, only the most recent price is available, and previous
prices are missing.
• In such cases, it may be necessary to infer missing data.
How to Handle Missing Data?

Student Grades Student ID Age Exam Score Grade

1 20 85 A
Data: 2 21 78 B
3 21 N/A N/A
4 22 88 A
5 23 90 A

Student ID Age Exam Score Grade

Ignore the tuple:
usually done when class label is 1 20 85 A
missing (when doing 2 21 78 B
classification)—not effective when
the % of missing values per attribute 4 22 88 A
varies considerably 5 23 90 A
Student ID 2 does not have grade in
the data given. So, we can ignore
that row.
How to Handle Missing Data?
Student Grades Data: Student ID Age Exam Score Grade
1 20 85 A
2 21 N/A B
3 N/ 78 C
A
4 22 88 A
5 23 90 A

Fill in it automatically with

a global constant : e.g., Student ID Age Exam Score Grade
“unknown”, a new class?!
1 20 85 A
Use a global constant like
"unknown" for missing values. 2 21 unknown B
3 unknown 78 C
4 22 88 A
5 23 90 A
How to Handle Missing Data?
Student ID Age Exam Score Grade
• Student Grades Data: 1 20 85 A
2 21 N/A B
3 N/A 78 C
4 22 88 A
5 23 90 A

• Fill in it automatically with Student Age Exam Grade

• the attribute mean ID

1 20
Score
85 A
2 21 85.25 B
3 21.5 78 C
Age mean = 21.5,
Exam Score mean = 85.25 4 22 88 A
5 23 90 A
How to Handle Missing Data?
Student Grades Data: Student ID Age Exam Score Grade
1 20 85 A
2 21 N/A B
3 N/A 78 C
4 22 88 A
5 23 90 A
Fill in it automatically with
the attribute mean for all samples belonging to the same class: smarter

Student Age Exam Grade

ID Score
1 20 85 A
2 21 85 B
3 22 78 C

4 22 88 A
5 23 90 A
How to Handle Missing Data?

Fill in it automatically with

the most probable value: inference-based such as Bayesian formula or
decision tree
(Will be explained with examples and hands-on in classification topic)

Fill in the missing value manually —------ but this process will be —----
tedious + infeasible
Numerical

Fill in Missing Values: Given the dataset:

Age Salary
25 5000
30 ?
35 8000
? 10000
27 5500
Noisy Data
• Noise: random error or variance in a measured variable
- A temperature sensor records 100°C in one reading while nearby sensors
show around 22°C. The 100°C reading is likely random noise.
• Incorrect attribute values may be due to
• faulty data collection instruments
- A humidity sensor reads 120% (which is impossible) due to a
malfunction.
• data entry problems
- An income field incorrectly shows -$5,000 due to a typing error.
• data transmission problems
- A transaction record shows missing data (e.g., Amount is N/A) because
of a transmission error.
• inconsistency in naming convention
- "john doe" vs. "John Doe" in user records, leading to inconsistent
naming.
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
Numerical

Smooth Noisy Data: Given the dataset: Temperature

20
22
21
100
23
20
25
19

Apply a moving average filter with a window size of 3 to smooth the data.
Data Cleaning - IQR (Interquartile Range) method

The IQR method is used to identify and remove outliers from a dataset.
It is based on the range within which the middle 50% of data values lie.
Outliers are values that fall outside this range.

Quartiles divide a dataset into four equal parts. Q1 (First Quartile) is the 25th percentile,
and Q3 (Third Quartile) is the 75th percentile. Here’s how to calculate them:
Steps to Calculate Q1 and Q3:
Sort the Data: Arrange your data in ascending order.
Determine the Position of Q1 and Q3:
Q1 Position: Position of Q1=(N+1)/4
Q3 Position: Position of Q3=3×(N+1)/4
Where N is the number of data points.

Find Q1 and Q3:

If the position is an integer: The quartile value is the data point at that position.
If the position is not an integer: Interpolate between the two closest data points.
Data Cleaning - IQR (Interquartile Range)
method
Calculate the IQR:
Find the first quartile (Q1) and the third quartile (Q3).
Compute the IQR as IQR=Q3−Q1

Determine the bounds:

Calculate the lower bound as Lower Bound=Q1−1.5×IQR
Calculate the upper bound as Upper Bound=Q3+1.5×IQR
Identify outliers:
Any data point outside the lower and upper bounds is considered an
outlier.
Remove the outlier from the data set.
Numerical
Identify or Remove Outliers: Given the dataset:
Scores
80
86
79
150
88
90
84
92
25
Identify the outlier using the IQR (Interquartile Range) method and remove it.
Data Integration

• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
Example- Integration of Multiple Databases
Database 1: Integrating the databases to
ID Age form a single dataset
1 34
2 45 ID Name Age
3 27 1 John 34
4 40 2 Alice 45
5 19
Database 2: 3 Bob 27
ID Name 4 Rahul 40
1 John 5 Harry 19
2 Alice
3 Bob
4 Rahul
5 Harry
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

Object identification: The same attribute or object may have different
names in different databases
Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Reduction Strategies

Data reduction:
Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical
results

Why data reduction?

A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
Data Reduction Strategies
Data reduction strategies
Dimensionality reduction
Wavelet transforms
Example: Compressing an image by converting it into wavelet coefficients and
keeping only the important ones.
Principal Components Analysis (PCA)
Example: Reducing a dataset of 100 features to just 10 principal components.
Feature subset selection, feature creation
Example: Using only the height and weight from a health dataset, ignoring other
less important features.

Numerosity reduction (Data Reduction)

Regression and Log-Linear Models
Example: Using a simple line to predict house prices based on size.
Histograms, clustering, sampling
Example: Creating a bar chart to show the frequency of different age groups in a
survey.
Data cube aggregation
Example: Summarizing sales data by week instead of daily.

Data compression
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
Min-max normalization

Min-Max Normalization rescales feature values to a specified range, usually

[0, 1].
Sometimes it rescales feature values to [-1, 1].

It is used to ensure that all features contribute equally to the analysis by

standardizing their range.
Person Age
X − min( X ) A 23
Normalized Value =
max( X ) − min( X ) B 45
Where:
C 30
X is the original value. D 50
Min(X) is the minimum value of the feature. E 40
Max(X) is the maximum value of the feature.
Example
Person Age
Minimum Age (Min) = 23 A 23
Maximum Age (Max) = 50 B 45
Apply Min-Max Normalization: C 30
For each value: D 50
Normalized Age = (Age−23)/(50−23)
E 40

Person A: Normalized Age=(23−23)/(50−23)=0/27=0 Person Age Normalized Age

Person B: Normalized Age=(45−23)/(50−23)=22/27≈0.81
Person C: Normalized Age=(30−23)/(50−23)=7/27≈0.26 A 23 0.00
Person D: Normalized Age=50−23)/(50−23)=27/27=1 B 45 0.81
Person E: Normalized Age=(40−23)/(50−23)=17/27≈0.63 C 30 0.26
D 50 1.00
E 40 0.63
Numerical

Normalization: Given the dataset:

Value
100
400
200
500
Normalize the values to a range of [0, 1].
Data Discretization Methods
Typical methods: All the methods can be applied recursively Customer ID Age
Binning (Top-down split, unsupervised)
1 23
Binning is a data preprocessing technique that transforms 2 45
numerical variables into categorical ones by dividing the range
of the variable into bins. 3 31
4 52
Typically top-down, where the range of the variable is divided
into intervals (bins) in an unsupervised manner. 5 37

Customer Age Age

ID Group
1 23 20-29
2 45 40-49
3 31 30-39
4 52 50-59
5 37 30-39
Data Discretization Methods
Histogram analysis (Top-down split, unsupervised) Customer Age
ID
It involves creating histograms to understand the 1 23
distribution of data points within bins. 2 45
3 31
Approach: This method can be seen as a top-down
4 52
split where data is divided into bins based on the value
range in an unsupervised manner. 5 37

Customer Age Age

ID Group Age Frequency
1 23 20-29 20-29 1
2 45 40-49 30-39 2
3 31 30-39 40-49 1
4 52 50-59 50-59 1
5 37 30-39
Data Discretization Methods
Clustering analysis (unsupervised, top-down
Customer Age Purchase
split or bottom-up merge) ID Amount
1 23 200
Clustering is the task of dividing a set of
2 45 500
objects into groups (clusters) so that objects
3 31 150
in the same cluster are more similar to each 4 52 700
other than to those in other clusters. 5 37 300

Approach:
Clusters: Assuming 2 clusters based
Top-down split: Methods like divisive clustering
on age and purchase amount:
start with all data points in one cluster and split
Cluster 1: Customers 1, 3, 5
them recursively.
(younger, lower purchases)
Bottom-up merge: Methods like agglomerative
Cluster 2: Customers 2, 4 (older,
clustering start with each data point as its own
higher purchases)
cluster and merge them recursively.
Data Discretization Methods
Decision-tree analysis (supervised, top- Customer ID Age Purchase Repeat
Amount Purchase
down split) 1 23 200 No
Decision-tree analysis involves using a tree- 2 45 500 No
like model to make decisions based on the 3 31 150 No
values of input features. It is a supervised 4 52 700 Yes
learning method. 5 37 300 No
Approach: The method starts at the root and
splits the data recursively into subsets based
on feature values (top-down split).

Decision Tree:
Node 1 (Root): Age
Age < 40: Repeat Purchase = No

Age ≥ 40: Purchase Amount

Amount < 600: Repeat Purchase = No

Amount ≥ 600: Repeat Purchase = Yes
Data Discretization Methods
Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

Correlation analysis examines the relationship between two or more

variables. The chi-squared (χ2) test is often used to test the independence
of two categorical variables.

Approach: In a bottom-up merge, variables that have a significant

correlation are grouped together, which can be seen in methods like
hierarchical clustering that use correlation measures to merge clusters.
Correlation (e.g., χ2) analysis
Step 1: Create a Contingency Table Customer Age Repeat
ID Purchase
A contingency table is used to display
1 23 No
the frequency distribution of the
2 45 Yes
variables. It shows how often certain 3 31 No
combinations of categories occur. 4 52 Yes
5 37 No
1. Identify the variables: In this case, 6 28 No
the variables are "Age Group" and 7 55 Yes
"Repeat Purchase". 8 49 Yes

2. Organize the data: Arrange the data Age Repeat Repeat Row
into a table format, showing the counts Group Purchase: Yes Purchase: No Total
20-29 0 2 2
of each combination of the variables.
30-39 0 2 2
40-49 2 0 2
50-59 2 0 2
Column 4 4 8
Total
Correlation (e.g., χ2) analysis
Step 2: Calculate Expected Frequencies

Expected frequencies are calculated to determine what the frequencies would be if

there was no association between the variables.

Formula for expected frequency: Expected frequency=(Row total×Column

total)/Grand total

Apply the formula: For each cell in the table, calculate the expected frequency.

Age Repeat Repeat Row Age Expected Expected

Group Purchase: Yes Purchase: No Total Group Yes No
20-29 0 2 2 20-29 1 1
30-39 0 2 2 30-39 1 1
40-49 2 0 2 40-49 1 1
50-59 2 0 2 50-59 1 1
Column 4 4 8
Total
Correlation (e.g., χ2) analysis
Step 3: Calculate the Chi-Squared Statistic

The chi-squared statistics measures how the observed frequencies deviate from the
2
expected frequencies
χ 2
=
(Oi − Ei )
Formula for chi-squared: Ei
Oi is the observed frequency and Ei is the expected frequency

Calculate the statistic: Compute the chi-squared value for each cell and sum them up.

Age Observed Expected (O-E)²/E Observed Expected (O-E)²/E

Group (Yes) (Yes) (Yes) (No) (No) (No)
20-29 0 1 1 2 1 1
30-39 0 1 1 2 1 1
40-49 2 1 1 0 1 1
50-59 2 1 1 0 1 1

Sum the chi-squared values for each cell: χ2 = 1+1+1+1+1+1+1+1 = 8

The Chi Square Distribution
Correlation (e.g., χ2) analysis
Step 4: Calculate Degrees of Freedom

Degrees of freedom are used to determine the critical value from the chi-squared
distribution.

Formula for degrees of freedom: Degrees of freedom=(No of rows−1)×(No of columns−1)

Calculate: Degrees of freedom=(4−1)×(2−1)=3

Step 5: Perform Chi-Squared Test

Compare the calculated chi-squared statistic to the critical value from the chi-squared
distribution table at the desired significance level (e.g., 0.05) with the calculated degrees
of freedom.

Find the critical value: For 3 degrees of freedom at the 0.05 significance level, the
critical value for χ2 is approximately 7.815.

Compare the values: Since the calculated χ2=8 is greater than 7.815, reject the null
hypothesis that the variables are independent.
Analyse the correlation (e.g., χ2) for the given data

Treatment Treatment
Outcome Control
A B

Recovered 30 50 20

Not
20 40 40
Recovered
Summary
• Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization

DWDM 3
No ratings yet
DWDM 3
12 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Unit 3
No ratings yet
Unit 3
41 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DataSet Special Session-AI
No ratings yet
DataSet Special Session-AI
22 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Correlation
No ratings yet
Correlation
14 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
SAP Transaction Codes Your Quick Reference To Transactions in SAP
No ratings yet
SAP Transaction Codes Your Quick Reference To Transactions in SAP
30 pages
Mongoose
No ratings yet
Mongoose
39 pages
Ms-Access Note
No ratings yet
Ms-Access Note
10 pages
13-MySQL Basics Cs 12
No ratings yet
13-MySQL Basics Cs 12
47 pages
Exp 6
No ratings yet
Exp 6
12 pages
MYSQL Day1
No ratings yet
MYSQL Day1
4 pages
Pet Shop Report
No ratings yet
Pet Shop Report
29 pages
Data Archiving With Archive Development Kit (ADK)
0% (1)
Data Archiving With Archive Development Kit (ADK)
87 pages
(Bitnine, Apache AGE) Preliminary Responses For Neo4 Inquiries
No ratings yet
(Bitnine, Apache AGE) Preliminary Responses For Neo4 Inquiries
4 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
An Introduction: Data Medium Exchange Engine (DMEE)
50% (2)
An Introduction: Data Medium Exchange Engine (DMEE)
26 pages
Lab 3
No ratings yet
Lab 3
7 pages
Oracle Database 10g Administration Workshop II
No ratings yet
Oracle Database 10g Administration Workshop II
3 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
DB2 UDB For OS390 and ZOS V8 Installation Guide
No ratings yet
DB2 UDB For OS390 and ZOS V8 Installation Guide
630 pages
Module 2 Units 3 & 4 Review Questions
No ratings yet
Module 2 Units 3 & 4 Review Questions
4 pages
Triggers & Active Data Bases
No ratings yet
Triggers & Active Data Bases
10 pages
DBMS Unit 2
No ratings yet
DBMS Unit 2
38 pages
Case+study-+IIITB+ +upGrad+Template
No ratings yet
Case+study-+IIITB+ +upGrad+Template
19 pages
Chapter 7 Managing Data Resources
No ratings yet
Chapter 7 Managing Data Resources
52 pages
Dfo Quiz PDF Free
No ratings yet
Dfo Quiz PDF Free
22 pages
DWDM Mid-1 Question Paper & Offline Bits
No ratings yet
DWDM Mid-1 Question Paper & Offline Bits
3 pages
SQL Unleashed Part1 Part2 1701889964
No ratings yet
SQL Unleashed Part1 Part2 1701889964
26 pages
Unit-2 Data Models
No ratings yet
Unit-2 Data Models
92 pages
Abdi
No ratings yet
Abdi
4 pages
Online Bookstore Abstraction
No ratings yet
Online Bookstore Abstraction
9 pages
Computer Institute - 1
No ratings yet
Computer Institute - 1
22 pages
ADO and SQL SERVER
No ratings yet
ADO and SQL SERVER
493 pages
CHapter 2 Data Data Warehousing and OLAP Technologies
No ratings yet
CHapter 2 Data Data Warehousing and OLAP Technologies
18 pages
LINUX Practical Exam
0% (1)
LINUX Practical Exam
4 pages

Lecture 02

Uploaded by

Lecture 02

Uploaded by

Data Preprocessing:

Concepts and Techniques

Assessment of data quality: A comprehensive perspective

Explanation: On July 3rd and 4th, the temperature sensor malfunctioned,

Transaction ID Customer ID Product ID Quantity Price Total

• data not entered due to misunderstanding

Student Grades Student ID Age Exam Score Grade

Student ID Age Exam Score Grade

Fill in it automatically with

• Fill in it automatically with Student Age Exam Grade

• the attribute mean ID

Student Age Exam Grade

Fill in it automatically with

Fill in Missing Values: Given the dataset:

Smooth Noisy Data: Given the dataset: Temperature

Find Q1 and Q3:

Determine the bounds:

Redundant data occur often when integration of multiple databases

Why data reduction?

Numerosity reduction (Data Reduction)

Min-Max Normalization rescales feature values to a specified range, usually

It is used to ensure that all features contribute equally to the analysis by

Person A: Normalized Age=(23−23)/(50−23)=0/27=0 Person Age Normalized Age

Normalization: Given the dataset:

Customer Age Age

Customer Age Age

Age ≥ 40: Purchase Amount

Amount < 600: Repeat Purchase = No

Correlation analysis examines the relationship between two or more

Approach: In a bottom-up merge, variables that have a significant

Expected frequencies are calculated to determine what the frequencies would be if

Formula for expected frequency: Expected frequency=(Row total×Column

Age Repeat Repeat Row Age Expected Expected

Age Observed Expected (O-E)²/E Observed Expected (O-E)²/E

Sum the chi-squared values for each cell: χ2 = 1+1+1+1+1+1+1+1 = 8

Formula for degrees of freedom: Degrees of freedom=(No of rows−1)×(No of columns−1)

Step 5: Perform Chi-Squared Test

You might also like