0% found this document useful (0 votes)

16 views

Lecture 1

Here is an example of detecting outliers using inter-quartile range: Given data set: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36 1) Compute quartiles: Q1 = 15 Q2 = 40 Q3 = 43 2) Compute IQR: IQR = Q3 - Q1 = 43 - 15 = 28 3) Check for outliers: Lower limit = Q1 - 1.5*IQR = 15 - 1.5*28 = 6 Upper limit = Q3 + 1.5*IQR = 43 + 1.5*28 = 71 Values <= 6 or >= 71 are

Uploaded by

Saad Mohamed Saad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Lecture 1

Uploaded by

Saad Mohamed Saad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CSE 412: SELECTED TOPICS IN

COMPUTER ENGINEERING

Data Cleansing
Reasons for Data Cleansing
2

 The data to be analyzed may be:

 Incomplete; where the data is missing
 Noisy; where data may contain errors or outlier values
 Inconsistent; where data may contain discrepancies in
the values
How can Data be Cleaned?
3

 Filling-in Missing Values

 Smoothing Noisy Data

 Identifying and Removing Outliers

 Resolving Inconsistency
Typical Example
4

Sample Table
Typical Example (Incomplete Data)
5

Sample Table
Typical Example (Data Values Errors)
6

Sample Table

Reasons:
• The Zip code consists of five digits and cannot contain any letters
• Income must be positive number
• Age must be positive number
Typical Example (Outlier Values)
7

Sample Table

Reasons:
• Outliers are data values that deviate from expected values of the rest of the
data set
• The values 10000000 and -40000 look very divergent from the rest of
values
Typical Example (Ambiguity)
8

Sample Table

Reasons:
• “S” in Marital Status could refer to “Single” or “Separated”
• So, there is a kind of ambiguity in the data
Reasons for Incomplete Data
9

 Relevant data may not be recorded because:

 A misunderstanding from the data entry persons
 Equipment failure
 Relevant data may not be available because it is
unknown or providing it is optional
Dealing with Incomplete Data
10

There are several ways to deal with missing data:

 Replace the missing value with some default value
 Replace the missing value with the field mean for the
fields that take numerical values or the mode (if exists)
for the fields that take categorical values
 Replace the missing values with a value generated at
random from the field distribution observed
Mean, Median, and Mode
11

 The mean for a population of size n can be computed by:

 Consider the following list of 9 numbers:

13, 15, 12, 17, 22, 11, 13, 19, 12

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889

Mean, Median, and Mode
12

 The median is the middle value of the ordered list of

numbers.
 Consider the following list of 9 numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12
 To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 15, 17, 19, 22

 Hence, the median is 13

Mean, Median, and Mode
13

 The median is the middle value of the ordered list of

numbers.
 Consider the following list of 10 numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12, 14
 To compute the median, you need first to order the numbers:

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

 Hence, the median is (13 + 14)/2 = 13.5

Mean, Median, and Mode
14

 The mode of a set of data is the value in the set that

occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 13

Number Occurrence Number Occurrence

13 3 22 1
15 1 11 1
12 1 19 1
17 1
Mode is 13
Mean, Median, and Mode
15

 The mode of a set of data is the value in the set that

occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 13, 19, 12

Number Occurrence Number Occurrence

13 2 22 1
15 1 11 1
12 2 19 1
17 1
Mode is 13 and12 (Bimodal)
Mean, Median, and Mode
16

 The mode of a set of data is the value in the set that

occurs most often.
 Consider the following list of numbers:
13, 15, 12, 17, 22, 11, 19

Number Occurrence Number Occurrence

13 1 22 1
15 1 11 1
12 1 19 1
17 1
There is no mode
Handling Missing Values
17

A set of fields with missing values

Handling Missing Values
18

A set of fields with missing values

Handling Missing Values (Using Default
Values)
19

Defaults
• Field1 Default: 0 Field2 Default: N
• Field3 Default: 240 Field4 Default: 50
Handling Missing Values (Using Means
and Modes)
20

 Use the mean for the numeric fields and the mode (if
exists) for the categorical fields
 If mode doesn’t exist, you need to rely on either a default
value or to use a random value
 Numeric Fields: Field1, Field3, and Field4
 Field1 Mean = (21+24+22+12+11+16+16+17+18)/9
= 17.44
 Field3 Mean = 334.44
 Field4 mean = 81.78
 If any field doesn’t accept decimal values, just
approximate the mean value
Handling Missing Values (Using Means
and Modes)
21

 Field2 is categorical, hence we need to compute the mode

from the existing values

Category Occurrence
A 3
B 1
W 2
C 2

 Hence, the mode is A

Handling Missing Values (Using Means
and Modes)
22

Assumptions
• Assume Field 1 and Field4 don’t accept decimal numbers, Hence we approximate
the mean
• Field3 accepts decimal numbers, hence we don’t approximate the mean value
Handling Missing Values (Using Random
Values)
23
Handling Outliers
24

Sample Table
Data Set Possible Outlier Values

• Outliers are data values that deviate from expected values of the rest of the
data set
• Outliers are extreme values that lie near the limits of the data range or go
against the trend of the remaining data.
• Normally, outliers need more investigation to make sure that they don’t occur
due to mistakes during data entry
Handling Outliers Using Inter-quartile
Range
25

Q1 Q3

75% of data items

Sorted Data Items
25% of data
items

50% of data items

• Quartile is any of the three values which divide the sorted data set
into four equal parts
• First quartile (Q1) cuts off lowest 25% of data
• Second quartile (Q2) cuts data set in half (it is the median of the data
set)
• Third quartile (Q3) cuts off highest 25% of data, or lowest 75%
Computing Q1, Q2, and Q3
26

 to compute Q1, Q2, and Q3, use the following

method:
 Order the given data set in ascending order.
 Use the median to divide the ordered data set into two
halves. This median is second quartile (Q2). Exclude this
median (if it is one of the data items) from any further
computation.
 The first quartile (Q1) value is the median of the lower
half of the data.
 The third quartile (Q3) value is the median of the upper
half of the data.
Example #1 of Computing Q1, Q2, and Q3
27

 compute Q1, Q2, and Q3 for the following data set:

6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36

 Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

 Q2 = 40 (median of the data set)

 Q1 is the median of the lower half of the data (shown in red).
 Q1 = 15
 Q3 is the median of the upper half of the data (shown in green).
 Q3 = 43
Example #2 of Computing Q1, Q2, and Q3
28

 compute Q1, Q2, and Q3 for the following data set:

39, 36, 7, 40, 41, 17
 Order the given data set in ascending order:
7, 17, 36, 39, 40, 41
 Q2 = (36+39)/2 = 37.5 (median of the data set). Note, the number of
data items is even so the median is the average of the middle two data
items
 The median is not a data item, hence we need to use all items in the first
half of the data items to compute Q1 and the rest of the items are used
to compute Q3
 Q1 = 17
 Q3 is the median of the upper half of the data (shown in green).
 Q3 = 40
Detecting Outliers using Inter-quartile Range
29

 Compute the Inter-Quartile Range (IQR) as follows:

IQR = Q3 − Q1
 A data value is an outlier if:
 its value is <= (Q1 – 1.5*IQR), or
 its value is >= (Q3 + 1.5*IQR).
Example of Detecting Outliers using Inter-
quartile Range
30

Sample Table
Data Set that might contains outliers

 Data Set
75000, -40000, 10000000, 50000, 99999
Example of Detecting Outliers using Inter-
quartile Range
31

 Data Set:
75000, -40000, 10000000, 50000, 99999
 Ordered Data Set:
-40000, 50000, 75000, 99999, 10000000
 Q2 = 75000
 Q1 = (–40000+50000)/2 = 5000
 Q3 = (99999+10000000)/2 = 5049999.5
 IRQ = Q3 – Q1
= 5049999.5 – 5000
= 5044999.5
 Q1 – 1.5*IRQ = 5000 – 1.5*50449999.5 = – 7562499.5
 Q3 + 1.5*IRQ = 5049999.5 + 1.5*5044999.5 = 12617498.75
 All data in the data set are within range, hence there is no outliers in this
example
Example of Detecting Outliers using Inter-
quartile Range
32

Data Set that might contains outliers

 Data Set
75000, 40000, 10000000, 50000, 99999, 75000
Example of Detecting Outliers using Inter-
quartile Range
33

 Data Set:
75000, 40000, 10000000, 50000, 99999, 75000
 Ordered Data Set:
40000, 50000, 75000, 75000, 99999, 10000000
 Q2 = (75000+ 75000)/2 = 75000
 Q1 = 50000
 Q3 = 99999
 IRQ = Q3 – Q1
= 99999 – 50000
= 49999
 Q1 – 1.5*IRQ = 50000 – 1.5*49999 = –24998.5
 Q3 + 1.5*IRQ = 99999 + 1.5* 49999 = 174997.5
 Hence data item 10000000 is an outlier and should be re-investigated for any
data-entry errors
Noisy Data
34

 Noisy data are the kind of data that have incorrect

values
 Some reasons for noisy data:
 Data collection instruments may be faulty
 Human or computer errors may occur during data entry

 Transmission errors may occur

 Technology limitations like buffer size, may occur during

data-entry
Noisy Data
35

Examples of Noisy Data

Smoothing Noisy Data
36

 By smoothing noisy data we can correct the errors

 Smoothing noisy data is performed by:
 Validationand correction
 Standardization
Validation and Correction of Noisy Data
37

 This step examines the data for data-entry errors and

tries to correct them automatically as far as possible
according to the following guidelines:
 Spell checking based on dictionary lookup is useful for
identifying and correcting misspellings.
Example: Kairo can be spell-checked and corrected into
Cairo

 Use dictionaries on geographic names and zip codes helps

to correct address data.
Example: Zip code 1243456 can be detected as an error
since there is no Zip code matches this value
Validation and Correction of Noisy Data (Cont.)
38

 Check validation rules and make sure field values follow the
rules; for example:
 Age is not less than certain amount and age is a positive
number.
Example: if there is a rule governing your data says that age
must be between 20 and 60, then ages of 18, 15, and 68
are detected as errors
 Each value of the categorical values belong to certain
category.
Example: if all the categories you have are A, B, C, and D,
Then if categories W or N are found they will be declared
as errors
Validation and Correction of Noisy Data (Cont.)
39

 Checkthe fields that have ambiguous values and check for

any possible data-entry errors

Example: Using the same category value to refer to

different meaning. “S” in “Marital Status” field could refer to
“Single” or “Separated”
Standardization to Smooth Noisy Data
40

 Data values should be consistent and have uniform format. For

example:
 Date and time entries should have a specific format
Oct. 19, 2009 10/19/2009 19/10/2009
All dates must be written with the same format
that have been agreed upon (e.g., Day/Month/Year)
 Names and other string data should be converted to either upper or
lower case.
MOHAMED AHMED instead of Mohamed Ahmed
 Removing prefixes and suffixes from names.
Mohamed Ahmed instead of Mr. Mohamed Ahmed
Mohamed Ahmed instead of Mohamed Ahmed, Ph.D.
Standardization to Smooth Noisy Data (Cont.)
41

 Abbreviationsand encoding schemes should consistently be

resolved by consulting special dictionaries or applying
predefined conversion rules.

US is the standard abbreviation of United States

Data Inconsistency
42

 Data inconsistency means that different data items contain

discrepancies in their values
 It can occur when different data items depend on other data
items and their values don’t match; for example:
 Age and Birth-date; age can be computed from the birth-date,
hence the value of Age must match the value computed from the
birth-date
 City and Phone-area-code; each city has certain area-code
 Total-price and (unit-price and quantity); total-price can be
computed from the unit-price and quantity
 These dependencies can be utilized to detect errors and
substitute missing values or correct wrong values
Data Inconsistency Example
43

Example of Inconsistent Data

Data Inconsistency Marked in Red

Incorrect Area-Code
Total_Price doesn’t equal (Quantity*Unit_Price)

Preview-AI-Engineering-by-Chip-Huyen
50% (2)
Preview-AI-Engineering-by-Chip-Huyen
21 pages
Soal AWS
No ratings yet
Soal AWS
16 pages
Cell2Cell Data Documentation
No ratings yet
Cell2Cell Data Documentation
4 pages
Sec#1 DDM
No ratings yet
Sec#1 DDM
18 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
User Manual of Turbo HD DVR
No ratings yet
User Manual of Turbo HD DVR
246 pages
8th PPT Lecture On Measures of Position
0% (1)
8th PPT Lecture On Measures of Position
19 pages
Unit 1
No ratings yet
Unit 1
26 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
Probabilistik Dan Proses Stokastik
No ratings yet
Probabilistik Dan Proses Stokastik
31 pages
UNIT02
No ratings yet
UNIT02
41 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Unit 4
No ratings yet
Unit 4
66 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
253777
No ratings yet
253777
66 pages
3 Stats Box and Whisker
No ratings yet
3 Stats Box and Whisker
35 pages
7_2
No ratings yet
7_2
34 pages
Measure of Variation
No ratings yet
Measure of Variation
50 pages
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
No ratings yet
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
7 pages
Descriptive Statistics Week 2: L2 - Graphical Display of Data
No ratings yet
Descriptive Statistics Week 2: L2 - Graphical Display of Data
22 pages
Homework Chapter 3: 18520441 Nguyễn Văn Bảo-18520494 Lê Thịnh-18521437
No ratings yet
Homework Chapter 3: 18520441 Nguyễn Văn Bảo-18520494 Lê Thịnh-18521437
8 pages
R_-_III_UNIT[1]
No ratings yet
R_-_III_UNIT[1]
34 pages
04c - Data Management (Relative Position) PDF
No ratings yet
04c - Data Management (Relative Position) PDF
3 pages
Unit 1
No ratings yet
Unit 1
21 pages
Measures of The Location of The Data
No ratings yet
Measures of The Location of The Data
13 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Sec 2.8 - 2021
No ratings yet
Sec 2.8 - 2021
20 pages
Lesson 8 - Measure of Relative Position
No ratings yet
Lesson 8 - Measure of Relative Position
6 pages
EDA 1 Continuation
No ratings yet
EDA 1 Continuation
10 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
AGA 3842-2022-2023. Descriptive Statistics
No ratings yet
AGA 3842-2022-2023. Descriptive Statistics
101 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
26 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
Worksheet # 3 - Solution
No ratings yet
Worksheet # 3 - Solution
7 pages
Assignment 1 Midterm
No ratings yet
Assignment 1 Midterm
5 pages
Data - Mining 1 18 36
No ratings yet
Data - Mining 1 18 36
19 pages
G9 - Statistics - Cumulative Frequency Measuring The Spread Box Plot Freq Density
No ratings yet
G9 - Statistics - Cumulative Frequency Measuring The Spread Box Plot Freq Density
8 pages
NAOMI Assasment 2 BUS STATS
No ratings yet
NAOMI Assasment 2 BUS STATS
4 pages
Chapter 2 - Measures of Location and Spread
No ratings yet
Chapter 2 - Measures of Location and Spread
3 pages
Practice 3 Measures of Dispersion 2023 09 20 19 02 53
No ratings yet
Practice 3 Measures of Dispersion 2023 09 20 19 02 53
18 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
65 pages
3.3 Assignment: One Variable Statistics: A) Histogram
No ratings yet
3.3 Assignment: One Variable Statistics: A) Histogram
12 pages
Mathematics p2 Exam Kit 2020
No ratings yet
Mathematics p2 Exam Kit 2020
244 pages
Lec2 1-Dataset1
No ratings yet
Lec2 1-Dataset1
32 pages
Business Statistics CH (7)
No ratings yet
Business Statistics CH (7)
37 pages
A Level Stats 3 Measures of Location Spread
No ratings yet
A Level Stats 3 Measures of Location Spread
1 page
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Mining Mid Term
No ratings yet
Data Mining Mid Term
9 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Lesson 5.2 Measures - of - Location
No ratings yet
Lesson 5.2 Measures - of - Location
38 pages
Module 3 Part 2S
No ratings yet
Module 3 Part 2S
22 pages
Measures of Position PartT 2
No ratings yet
Measures of Position PartT 2
21 pages
Basics of Math
From Everand
Basics of Math
Knowledge Flow
No ratings yet
Basic Math Notes
From Everand
Basic Math Notes
Ernest Bywater
5/5 (2)
Employability Skills: Brush Up Your Maths
From Everand
Employability Skills: Brush Up Your Maths
Clive W. Humphris
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Gcse Maths V11
From Everand
Gcse Maths V11
Clive W. Humphris
No ratings yet
Mathematics Principles V11
From Everand
Mathematics Principles V11
Clive W. Humphris
No ratings yet
Nader Gamal WDP - Nader Gamal
No ratings yet
Nader Gamal WDP - Nader Gamal
6 pages
Introduction
No ratings yet
Introduction
10 pages
IIS Lecture 6
No ratings yet
IIS Lecture 6
45 pages
sheet_MCQ-1
No ratings yet
sheet_MCQ-1
32 pages
Devops MCQ PDF
100% (2)
Devops MCQ PDF
6 pages
IIS Lecture 3
No ratings yet
IIS Lecture 3
21 pages
Sheet 2
No ratings yet
Sheet 2
11 pages
Sheet 1
No ratings yet
Sheet 1
11 pages
Tendernotice 1
No ratings yet
Tendernotice 1
3 pages
RSNetWorx ControlNET
No ratings yet
RSNetWorx ControlNET
100 pages
Zsarina Gurtiza Final
No ratings yet
Zsarina Gurtiza Final
1 page
Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 - The ebook is available for instant download, read anywhere
100% (9)
Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 - The ebook is available for instant download, read anywhere
89 pages
Pipe Fitter
No ratings yet
Pipe Fitter
14 pages
AE305 Chapter 2
No ratings yet
AE305 Chapter 2
14 pages
كورس ال PMO
No ratings yet
كورس ال PMO
3 pages
XG(S)PON Driver Programming Guide EN
No ratings yet
XG(S)PON Driver Programming Guide EN
46 pages
FUNCTIONS and RELATIONS (Discussion)
No ratings yet
FUNCTIONS and RELATIONS (Discussion)
32 pages
CFD 2006 - Chapter 5 FVM For Convection-Diffusion Problem
No ratings yet
CFD 2006 - Chapter 5 FVM For Convection-Diffusion Problem
27 pages
DSS.8.Dedy Sugiarto Trisakti University.2014
No ratings yet
DSS.8.Dedy Sugiarto Trisakti University.2014
7 pages
R2-BOM-DO224-256-SHANGRI-LA-ATS-MVSG
No ratings yet
R2-BOM-DO224-256-SHANGRI-LA-ATS-MVSG
1 page
WT Unit 2
No ratings yet
WT Unit 2
20 pages
Computer Studies: Paper 7010/01 Written Paper
No ratings yet
Computer Studies: Paper 7010/01 Written Paper
7 pages
BARBARIAN NUDGE-new
No ratings yet
BARBARIAN NUDGE-new
5 pages
Report Kedar Internship
No ratings yet
Report Kedar Internship
56 pages
How Blockchain Technology Boosts Operations Excellence 4.0 of Chemical Companies
100% (1)
How Blockchain Technology Boosts Operations Excellence 4.0 of Chemical Companies
24 pages
Smart Agriculture Model in Detecting Oil Palm Plantation Diseases Using A Convolution Neural Network
No ratings yet
Smart Agriculture Model in Detecting Oil Palm Plantation Diseases Using A Convolution Neural Network
8 pages
Adrian B. Biran Geometric and Engineering Drawing ENGLISH BOOK
No ratings yet
Adrian B. Biran Geometric and Engineering Drawing ENGLISH BOOK
159 pages
Test Automation Using Selenium Web Driver Java Preview
No ratings yet
Test Automation Using Selenium Web Driver Java Preview
44 pages
Travel Order - Jenette B. Pagpaguitan
No ratings yet
Travel Order - Jenette B. Pagpaguitan
4 pages
Chapter8 Abstraction
No ratings yet
Chapter8 Abstraction
8 pages
Minerva Library: Library Management System CCP 1103 - Computer Programming 3
No ratings yet
Minerva Library: Library Management System CCP 1103 - Computer Programming 3
4 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
GC 204
No ratings yet
GC 204
357 pages
Installation Checklist AB TOLBERT
No ratings yet
Installation Checklist AB TOLBERT
7 pages
Dropped Call Rate
No ratings yet
Dropped Call Rate
1 page

Lecture 1

Uploaded by

Lecture 1

Uploaded by

CSE 412: SELECTED TOPICS IN

 The data to be analyzed may be:

 Filling-in Missing Values

 Identifying and Removing Outliers

 Relevant data may not be recorded because:

There are several ways to deal with missing data:

 The mean for a population of size n can be computed by:

 Consider the following list of 9 numbers:

Mean = (13 + 15 + 12 + 17 + 22 + 11 + 13 + 19 + 12)/9 = 14.88889

 The median is the middle value of the ordered list of

11, 12, 12, 13, 13, 15, 17, 19, 22

 Hence, the median is 13

 The median is the middle value of the ordered list of

11, 12, 12, 13, 13, 14, 15, 17, 19, 22

 Hence, the median is (13 + 14)/2 = 13.5

 The mode of a set of data is the value in the set that

Number Occurrence Number Occurrence

 The mode of a set of data is the value in the set that

Number Occurrence Number Occurrence

 The mode of a set of data is the value in the set that

Number Occurrence Number Occurrence

A set of fields with missing values

A set of fields with missing values

 Field2 is categorical, hence we need to compute the mode

 Hence, the mode is A

75% of data items

50% of data items

 to compute Q1, Q2, and Q3, use the following

 compute Q1, Q2, and Q3 for the following data set:

 Order the given data set in ascending order:

6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49

 Q2 = 40 (median of the data set)

 compute Q1, Q2, and Q3 for the following data set:

 Compute the Inter-Quartile Range (IQR) as follows:

Data Set that might contains outliers

 Noisy data are the kind of data that have incorrect

 Transmission errors may occur

 Technology limitations like buffer size, may occur during

Examples of Noisy Data

 By smoothing noisy data we can correct the errors

 This step examines the data for data-entry errors and

 Use dictionaries on geographic names and zip codes helps

 Checkthe fields that have ambiguous values and check for

Example: Using the same category value to refer to

 Data values should be consistent and have uniform format. For

 Abbreviationsand encoding schemes should consistently be

US is the standard abbreviation of United States

 Data inconsistency means that different data items contain

Example of Inconsistent Data

Data Inconsistency Marked in Red

You might also like