0% found this document useful (0 votes)
43 views51 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

The document discusses data preprocessing for data mining. It explains that preprocessing is an important step that involves cleaning data by handling noise, missing values, inconsistencies etc. The major tasks in preprocessing are data cleaning, integration, transformation, reduction and discretization. Descriptive analysis of data through measures of central tendency and dispersion are also covered. Preprocessing makes data suitable for mining to obtain quality results.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views51 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

The document discusses data preprocessing for data mining. It explains that preprocessing is an important step that involves cleaning data by handling noise, missing values, inconsistencies etc. The major tasks in preprocessing are data cleaning, integration, transformation, reduction and discretization. Descriptive analysis of data through measures of central tendency and dispersion are also covered. Preprocessing makes data suitable for mining to obtain quality results.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

DATA MINING

CSE-443

Ayesha Aziz Prova


Lecturer,
Dept. of CSE
CWU
DATA PREPROCESSING

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

2
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THE DATA ANALYSIS PIPELINE
 Mining is not the only step in the analysis process

Data Preprocessing Result


Data Mining Post-processing

 Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make
sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature selection.
 A dirty work, but it is often the most important step for the analysis.

 Post-Processing: Make the data actionable and useful to the user


 Statistical analysis of importance
 Visualization.

3
 Pre- and Post-processing are often data mining tasks as well Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA QUALITY

 Examples of data quality problems:


 Noise and outliers Tid Refund Marital Taxable
Status Income Cheat
 Missing values
1 Yes Single 125K No
 Duplicate data
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries
10
9 No Single 90K No
4
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DATA PREPROCESSING?

 Data in the real world is dirty


 incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

5
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY IS DATA DIRTY?

 Incomplete data may come from


 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it
is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
6
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY IS DATA PREPROCESSING
IMPORTANT?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises the majority of
the work of building a data warehouse

7
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MULTI-DIMENSIONAL MEASURE OF DATA
QUALITY

 A well-accepted multidimensional view:


 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
 Broad categories:
 Intrinsic, contextual, representational, and accessibility
8
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MAJOR TASKS IN DATA
PREPROCESSING
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or similar
analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for numerical
data
9
Ayesha Aziz Prova,
Lecturer, CSE, CWU
FORMS OF DATA PREPROCESSING

10
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA PREPROCESSING

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

11
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MINING DATA DESCRIPTIVE
CHARACTERISTICS

 Motivation
 To better understand the data: central tendency, variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
12
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CENTRAL TENDENCY

 A measure of central tendency is a value at the center or middle of a data


set.
 Mean, median, mode

13
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TERMINOLOGY

 Population
 A collection of items of interest in research
 A complete set of things
 A group that you wish to generalize your research to
 An example – All the trees in Battle Park
 Sample
 A subset of a population
 The size smaller than the size of a population
 An example – 100 trees randomly selected from Battle Park

14
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SAMPLE VS. POPULATION

Population Sample
15
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEAN

• Mean – Most commonly used measure of central tendency

• Average of all observations

• The sum of all the scores divided by the number of scores

• Note: Assuming that each observation is equally significant

16
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEAN

Sample mean: Population mean:

17
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY – MEAN
• Example I
- Data: 8, 4, 2, 6, 10

 Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5

18
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WEIGHTED MEAN
• We can also calculate a weighted mean using some
weighting factor:
e.g. What is the average income of all
people in cities A, B, and C:
CityAvg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000

Here, population is the weighting factor and the average income is the
variable of interest 19
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median

 Median – This is the value of a variable such that half of the observations are
above and half are below this value i.e. this value divides the distribution into
two groups of equal size

 When the number of observations is odd, the median is simply equal to the
middle value

 When the number of observations is even, we take the median to be the


average of the two values in the middle of the distribution

20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEDIAN

21
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEDIAN
 Example I
 Data: 8, 4, 2, 6, 10 (mean: 6)
2, 4, 6, 8, 10 median: 6

• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)

7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5
22
median: (13.9 + 14.5) / 2 = 14.2 Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median

 For calculation of median in a continuous


frequency distribution the following formula will be
employed. Algebraically,

23
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median

Age Group Frequency of Cumulative


Median class(f) frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150 24
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median

Age Frequency Cumulative


Group of Median frequencies
class(f) (cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150

Total 150

25
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode

 Median – Mode is the most frequent value or score in the distribution.


 It is defined as that value of the item in a series
 Example I
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120
127 128 131 131 140 162

mode!!
26
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode

 The exact value of mode can be obtained by the following formula.

27
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode

Monthly rent (Rs) Number of Libraries (f)


500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
28
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode

Monthly Number of
rent (Rs) Libraries (f)
500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 12
Above&
Total 65
29
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode

– Value that occurs most frequently in the data


– Empirical formula:

30
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SYMMETRIC VS. SKEWED DATA

• Median, mean and mode of symmetric,


positively and negatively skewed data

31
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA SKEWED RIGHT

• Here we see that the data is skewed to the right


and the position of the Mean is to the right of the
Median.
– One may surmise that there is data that is tending to
spread the data out at the high end, thereby affecting 32
the value of the mean. Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA SKEWED LEFT

• Here we see that the data is skewed to the left and the position of the Mean is to the left of
the Median.
– One may surmise that there is data that is tending to spread the data out at the low end,
thereby affecting the value of the mean.

33
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURING THE DISPERSION OF DATA
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, M, Q3, max


– Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier
individually
– Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)


– Variance: (algebraic, scalable computation)

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


34
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SUMMARY MEASURES
Describing Data Numerically

Central Tendency Quartiles Variation Shape

Arithmetic Mean Range Skewness

Median Interquartile Range

Mode Variance

Geometric Mean Standard Deviation

Coefficient of Variation
35
Ayesha Aziz Prova,
Lecturer, CSE, CWU
QUARTILES
 Quartiles split the ranked data into 4 segments with an equal
number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50%
are larger)
 Only 25% of the observations are greater than the
third quartile 36
Ayesha Aziz Prova,
Lecturer, CSE, CWU
QUARTILES

 Find a quartile by determining the value in the appropriate position in the ranked data,
where
 First quartile position : Q1 at (n+1)/4
 Second quartile position : Q2 at (n+1)/2 (median)
 Third quartile position : Q3 at 3(n+1)/4
where n is the number of observed values

37
Ayesha Aziz Prova,
Lecturer, CSE, CWU
INTERQUARTILE RANGE

 Can eliminate some outlier problems by using the interquartile range


 Eliminate some high- and low-valued observations and calculate the range from the
remaining values

Interquartile range = 3rd quartile – 1st quartile


= Q3 – Q 1

38
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Quartiles
Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
Order the data
Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

Inter-Quartile Range = 9 - 5½ = 3½ 39
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Quartiles
Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

Inter-Quartile Range = 10 - 4 = 6 40
Ayesha Aziz Prova,
Lecturer, CSE, CWU
RANGE

 Simplest measure of variation


 Difference between the largest and the smallest observations:
 Disadvantages = ignores distribution of data and sensitive to outliers

Range = Xlargest – Xsmallest

41
Ayesha Aziz Prova,
Lecturer, CSE, CWU
BOXPLOT ANALYSIS

• Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles,
i.e., the height of the box is IRQ
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to
Minimum and Maximum
42
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot

Example 1: Draw a Box plot for the data below


Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

4 5 6 7 8 9 10 11 12 43
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot

Example 2: Draw a Box plot for the data below


Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

3 4 5 6 7 8 9 10 11 12 13 14 15 44
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu

137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186

Lower Upper
Quartile Median Quartile
= 158 = 171 = 180

130 140 150 160 170 180 cm 190


45
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot
Question: Gemma recorded the heights in cm of girls in the same class and
constructed a box plot from the data. The box plots for both boys and girls
are shown below. Use the box plots to choose some correct statements
comparing heights of boys and girls in the class. Justify your answers.

Boys

130 140 150 160 170 180 cm 190

Girls
1. The boys are taller on average.

2. The smallest person is a girl.


46
3. The tallest person is a boy. Ayesha Aziz Prova,
Lecturer, CSE, CWU
outliers

 outliers – Sometimes there are extreme values that are separated from the rest
of the data. These extreme values are called outliers. Outliers affect the
mean.
 The 1.5  IQR Rule for Outliers

 Call an observation an outlier if it falls more than 1.5  IQR


above the third quartile or below the first quartile.
 X < Q1 – 1.5  IQR
 X > Q3+ 1.5  IQR

47
Ayesha Aziz Prova,
Lecturer, CSE, CWU
outliers

 In the New York travel time data, we found Q1 = 15 minutes, Q3 = 42.5


minutes, and IQR = 27.5 minutes.
 For these data, 1.5  IQR = 1.5(27.5) = 41.25
 Q1 – 1.5  IQR = 15 – 41.25 = –26.25 (near 0)
 Q3+ 1.5  IQR = 42.5 + 41.25 = 83.75 (~80)
 Any travel time close to 0 minutes or longer than about 80 minutes is
considered an outlier.

48
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Boxplots and outliers
 Consider our NY travel times data. Construct a boxplot.
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45

5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85

Min=5 Q1 = 15 M = 22.5 Q3= 42.5


Max=85
This is an
outlier by the
1.5 x IQR rule

0 10 20 30 40 50 60 70 80 90 49
Tra v e lTim e Ayesha Aziz Prova,
Lecturer, CSE, CWU
THANKS

50
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???

You might also like