0% found this document useful (0 votes)
16 views

Lecture 1

Here is an example of detecting outliers using inter-quartile range: Given data set: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36 1) Compute quartiles: Q1 = 15 Q2 = 40 Q3 = 43 2) Compute IQR: IQR = Q3 - Q1 = 43 - 15 = 28 3) Check for outliers: Lower limit = Q1 - 1.5*IQR = 15 - 1.5*28 = 6 Upper limit = Q3 + 1.5*IQR = 43 + 1.5*28 = 71 Values <= 6 or >= 71 are
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 1

Here is an example of detecting outliers using inter-quartile range: Given data set: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36 1) Compute quartiles: Q1 = 15 Q2 = 40 Q3 = 43 2) Compute IQR: IQR = Q3 - Q1 = 43 - 15 = 28 3) Check for outliers: Lower limit = Q1 - 1.5*IQR = 15 - 1.5*28 = 6 Upper limit = Q3 + 1.5*IQR = 43 + 1.5*28 = 71 Values <= 6 or >= 71 are
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CSE 412: SELECTED TOPICS IN

COMPUTER ENGINEERING

Data Cleansing
Reasons for Data Cleansing
2

 The data to be analyzed may be:


 Incomplete; where the data is missing
 Noisy; where data may contain errors or outlier values
 Inconsistent; where data may contain discrepancies in
the values
How can Data be Cleaned?
3

 Filling-in Missing Values


 Smoothing Noisy Data

 Identifying and Removing Outliers

 Resolving Inconsistency
Typical Example
4

Sample Table
Typical Example (Incomplete Data)
5

Sample Table
Typical Example (Data Values Errors)
6

Sample Table

Reasons:
• The Zip code consists of five digits and cannot contain any letters
• Income must be positive number
• Age must be positive number
Typical Example (Outlier Values)
7

Sample Table

Reasons:
• Outliers are data values that deviate from expected values of the rest of the
data set
• The values 10000000 and -40000 look very divergent from the rest of
values
Typical Example (Ambiguity)
8

Sample Table

Reasons:
• “S” in Marital Status could refer to “Single” or “Separated”
• So, there is a kind of ambiguity in the data
Reasons for Incomplete Data
9

 Relevant data may not be recorded because:


 A misunderstanding from the data entry persons
 Equipment failure
 Relevant data may not be available because it is
unknown or providing it is optional
Dealing with Incomplete Data
10

There are several ways to deal with missing data:


 Replace the missing value with some default value
 Replace the missing value with the field mean for the
fields that take numerical values or the mode (if exists)
for the fields that take categorical values
 Replace the missing values with a value generated at
random from the field distribution observed
Mean, Median, and Mode
11

 The mean for a population of size n can be computed by:

 Consider the following list of 9 numbers:


13, 15, 12, 17, 22, 11, 13, 19, 12

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889


Mean, Median, and Mode
12

 The median is the middle value of the ordered list of


numbers.
 Consider the following list of 9 numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12
 To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 15, 17, 19, 22

 Hence, the median is 13


Mean, Median, and Mode
13

 The median is the middle value of the ordered list of


numbers.
 Consider the following list of 10 numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12, 14
 To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

 Hence, the median is (13 + 14)/2 = 13.5


Mean, Median, and Mode
14

 The mode of a set of data is the value in the set that


occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 13

Number Occurrence Number Occurrence


13 3 22 1
15 1 11 1
12 1 19 1
17 1
Mode is 13
Mean, Median, and Mode
15

 The mode of a set of data is the value in the set that


occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12

Number Occurrence Number Occurrence


13 2 22 1
15 1 11 1
12 2 19 1
17 1
Mode is 13 and12 (Bimodal)
Mean, Median, and Mode
16

 The mode of a set of data is the value in the set that


occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 19

Number Occurrence Number Occurrence


13 1 22 1
15 1 11 1
12 1 19 1
17 1
There is no mode
Handling Missing Values
17

A set of fields with missing values


Handling Missing Values
18

A set of fields with missing values


Handling Missing Values (Using Default
Values)
19

Defaults
• Field1 Default: 0 Field2 Default: N
• Field3 Default: 240 Field4 Default: 50
Handling Missing Values (Using Means
and Modes)
20

 Use the mean for the numeric fields and the mode (if
exists) for the categorical fields
 If mode doesn’t exist, you need to rely on either a default
value or to use a random value
 Numeric Fields: Field1, Field3, and Field4
 Field1 Mean = (21+24+22+12+11+16+16+17+18)/9
= 17.44
 Field3 Mean = 334.44
 Field4 mean = 81.78
 If any field doesn’t accept decimal values, just
approximate the mean value
Handling Missing Values (Using Means
and Modes)
21

 Field2 is categorical, hence we need to compute the mode


from the existing values

Category Occurrence
A 3
B 1
W 2
C 2

 Hence, the mode is A


Handling Missing Values (Using Means
and Modes)
22

Assumptions
• Assume Field 1 and Field4 don’t accept decimal numbers, Hence we approximate
the mean
• Field3 accepts decimal numbers, hence we don’t approximate the mean value
Handling Missing Values (Using Random
Values)
23
Handling Outliers
24

Sample Table
Data Set Possible Outlier Values

• Outliers are data values that deviate from expected values of the rest of the
data set
• Outliers are extreme values that lie near the limits of the data range or go
against the trend of the remaining data.
• Normally, outliers need more investigation to make sure that they don’t occur
due to mistakes during data entry
Handling Outliers Using Inter-quartile
Range
25

Q1 Q3

75% of data items


Sorted Data Items
25% of data
items

50% of data items


Q2

• Quartile is any of the three values which divide the sorted data set
into four equal parts
• First quartile (Q1) cuts off lowest 25% of data
• Second quartile (Q2) cuts data set in half (it is the median of the data
set)
• Third quartile (Q3) cuts off highest 25% of data, or lowest 75%
Computing Q1, Q2, and Q3
26

 to compute Q1, Q2, and Q3, use the following


method:
 Order the given data set in ascending order.
 Use the median to divide the ordered data set into two
halves. This median is second quartile (Q2). Exclude this
median (if it is one of the data items) from any further
computation.
 The first quartile (Q1) value is the median of the lower
half of the data.
 The third quartile (Q3) value is the median of the upper
half of the data.
Example #1 of Computing Q1, Q2, and Q3
27

 compute Q1, Q2, and Q3 for the following data set:


6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36

 Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

 Q2 = 40 (median of the data set)


 Q1 is the median of the lower half of the data (shown in red).
 Q1 = 15
 Q3 is the median of the upper half of the data (shown in green).
 Q3 = 43
Example #2 of Computing Q1, Q2, and Q3
28

 compute Q1, Q2, and Q3 for the following data set:


39, 36, 7, 40, 41, 17
 Order the given data set in ascending order:
7, 17, 36, 39, 40, 41
 Q2 = (36+39)/2 = 37.5 (median of the data set). Note, the number of
data items is even so the median is the average of the middle two data
items
 The median is not a data item, hence we need to use all items in the first
half of the data items to compute Q1 and the rest of the items are used
to compute Q3
 Q1 = 17
 Q3 is the median of the upper half of the data (shown in green).
 Q3 = 40
Detecting Outliers using Inter-quartile Range
29

 Compute the Inter-Quartile Range (IQR) as follows:


IQR = Q3 − Q1
 A data value is an outlier if:
 its value is <= (Q1 – 1.5*IQR), or
 its value is >= (Q3 + 1.5*IQR).
Example of Detecting Outliers using Inter-
quartile Range
30

Sample Table
Data Set that might contains outliers

 Data Set
75000, -40000, 10000000, 50000, 99999
Example of Detecting Outliers using Inter-
quartile Range
31

 Data Set:
75000, -40000, 10000000, 50000, 99999
 Ordered Data Set:
-40000, 50000, 75000, 99999, 10000000
 Q2 = 75000
 Q1 = (–40000+50000)/2 = 5000
 Q3 = (99999+10000000)/2 = 5049999.5
 IRQ = Q3 – Q1
= 5049999.5 – 5000
= 5044999.5
 Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5
 Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75
 All data in the data set are within range, hence there is no outliers in this
example
Example of Detecting Outliers using Inter-
quartile Range
32

Data Set that might contains outliers

 Data Set
75000, 40000, 10000000, 50000, 99999, 75000
Example of Detecting Outliers using Inter-
quartile Range
33

 Data Set:
75000, 40000, 10000000, 50000, 99999, 75000
 Ordered Data Set:
40000, 50000, 75000, 75000, 99999, 10000000
 Q2 = (75000+ 75000)/2 = 75000
 Q1 = 50000
 Q3 = 99999
 IRQ = Q3 – Q1
= 99999 – 50000
= 49999
 Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5
 Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5
 Hence data item 10000000 is an outlier and should be re-investigated for any
data-entry errors
Noisy Data
34

 Noisy data are the kind of data that have incorrect


values
 Some reasons for noisy data:
 Data collection instruments may be faulty
 Human or computer errors may occur during data entry

 Transmission errors may occur

 Technology limitations like buffer size, may occur during


data-entry
Noisy Data
35

Examples of Noisy Data


Smoothing Noisy Data
36

 By smoothing noisy data we can correct the errors


 Smoothing noisy data is performed by:
 Validationand correction
 Standardization
Validation and Correction of Noisy Data
37

 This step examines the data for data-entry errors and


tries to correct them automatically as far as possible
according to the following guidelines:
 Spell checking based on dictionary lookup is useful for
identifying and correcting misspellings.
Example: Kairo can be spell-checked and corrected into
Cairo

 Use dictionaries on geographic names and zip codes helps


to correct address data.
Example: Zip code 1243456 can be detected as an error
since there is no Zip code matches this value
Validation and Correction of Noisy Data (Cont.)
38

 Check validation rules and make sure field values follow the
rules; for example:
 Age is not less than certain amount and age is a positive
number.
Example: if there is a rule governing your data says that age
must be between 20 and 60, then ages of 18, 15, and 68
are detected as errors
 Each value of the categorical values belong to certain
category.
Example: if all the categories you have are A, B, C, and D,
Then if categories W or N are found they will be declared
as errors
Validation and Correction of Noisy Data (Cont.)
39

 Checkthe fields that have ambiguous values and check for


any possible data-entry errors

Example: Using the same category value to refer to


different meaning. “S” in “Marital Status” field could refer to
“Single” or “Separated”
Standardization to Smooth Noisy Data
40

 Data values should be consistent and have uniform format. For


example:
 Date and time entries should have a specific format
Oct. 19, 2009 10/19/2009 19/10/2009
All dates must be written with the same format
that have been agreed upon (e.g., Day/Month/Year)
 Names and other string data should be converted to either upper or
lower case.
MOHAMED AHMED instead of Mohamed Ahmed
 Removing prefixes and suffixes from names.
Mohamed Ahmed instead of Mr. Mohamed Ahmed
Mohamed Ahmed instead of Mohamed Ahmed, Ph.D.
Standardization to Smooth Noisy Data (Cont.)
41

 Abbreviationsand encoding schemes should consistently be


resolved by consulting special dictionaries or applying
predefined conversion rules.

US is the standard abbreviation of United States


Data Inconsistency
42

 Data inconsistency means that different data items contain


discrepancies in their values
 It can occur when different data items depend on other data
items and their values don’t match; for example:
 Age and Birth-date; age can be computed from the birth-date,
hence the value of Age must match the value computed from the
birth-date
 City and Phone-area-code; each city has certain area-code
 Total-price and (unit-price and quantity); total-price can be
computed from the unit-price and quantity
 These dependencies can be utilized to detect errors and
substitute missing values or correct wrong values
Data Inconsistency Example
43

Example of Inconsistent Data

Data Inconsistency Marked in Red

Incorrect Area-Code
Total_Price doesn’t equal (Quantity*Unit_Price)

You might also like