Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
CSE-443
2
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THE DATA ANALYSIS PIPELINE
Mining is not the only step in the analysis process
Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make
sense of the data
Techniques: Sampling, Dimensionality Reduction, Feature selection.
A dirty work, but it is often the most important step for the analysis.
3
Pre- and Post-processing are often data mining tasks as well Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA QUALITY
5
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY IS DATA DIRTY?
7
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MULTI-DIMENSIONAL MEASURE OF DATA
QUALITY
10
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA PREPROCESSING
11
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MINING DATA DESCRIPTIVE
CHARACTERISTICS
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
12
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CENTRAL TENDENCY
13
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TERMINOLOGY
Population
A collection of items of interest in research
A complete set of things
A group that you wish to generalize your research to
An example – All the trees in Battle Park
Sample
A subset of a population
The size smaller than the size of a population
An example – 100 trees randomly selected from Battle Park
14
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SAMPLE VS. POPULATION
Population Sample
15
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEAN
16
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEAN
17
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY – MEAN
• Example I
- Data: 8, 4, 2, 6, 10
Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
18
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WEIGHTED MEAN
• We can also calculate a weighted mean using some
weighting factor:
e.g. What is the average income of all
people in cities A, B, and C:
CityAvg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000
Here, population is the weighting factor and the average income is the
variable of interest 19
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median
Median – This is the value of a variable such that half of the observations are
above and half are below this value i.e. this value divides the distribution into
two groups of equal size
When the number of observations is odd, the median is simply equal to the
middle value
20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEDIAN
21
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURES OF CENTRAL TENDENCY –
MEDIAN
Example I
Data: 8, 4, 2, 6, 10 (mean: 6)
2, 4, 6, 8, 10 median: 6
• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)
7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5
22
median: (13.9 + 14.5) / 2 = 14.2 Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median
23
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Median
Total 150
25
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode
mode!!
26
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode
27
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode
Monthly Number of
rent (Rs) Libraries (f)
500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 12
Above&
Total 65
29
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Measures of Central Tendency – Mode
30
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SYMMETRIC VS. SKEWED DATA
31
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA SKEWED RIGHT
• Here we see that the data is skewed to the left and the position of the Mean is to the left of
the Median.
– One may surmise that there is data that is tending to spread the data out at the low end,
thereby affecting the value of the mean.
33
Ayesha Aziz Prova,
Lecturer, CSE, CWU
MEASURING THE DISPERSION OF DATA
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Mode Variance
Coefficient of Variation
35
Ayesha Aziz Prova,
Lecturer, CSE, CWU
QUARTILES
Quartiles split the ranked data into 4 segments with an equal
number of values per segment
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50%
are larger)
Only 25% of the observations are greater than the
third quartile 36
Ayesha Aziz Prova,
Lecturer, CSE, CWU
QUARTILES
Find a quartile by determining the value in the appropriate position in the ranked data,
where
First quartile position : Q1 at (n+1)/4
Second quartile position : Q2 at (n+1)/2 (median)
Third quartile position : Q3 at 3(n+1)/4
where n is the number of observed values
37
Ayesha Aziz Prova,
Lecturer, CSE, CWU
INTERQUARTILE RANGE
38
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Quartiles
Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
Order the data
Q1 Q2 Q3
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
Inter-Quartile Range = 9 - 5½ = 3½ 39
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Quartiles
Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Q1 Q2 Q3
Lower Upper
Quartile Median Quartile
= 4 = 8 = 10
Inter-Quartile Range = 10 - 4 = 6 40
Ayesha Aziz Prova,
Lecturer, CSE, CWU
RANGE
41
Ayesha Aziz Prova,
Lecturer, CSE, CWU
BOXPLOT ANALYSIS
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
4 5 6 7 8 9 10 11 12 43
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot
Lower Upper
Quartile Median Quartile
= 4 = 8 = 10
3 4 5 6 7 8 9 10 11 12 13 14 15 44
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Drawing a Box Plot
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
Lower Upper
Quartile Median Quartile
= 158 = 171 = 180
Boys
Girls
1. The boys are taller on average.
outliers – Sometimes there are extreme values that are separated from the rest
of the data. These extreme values are called outliers. Outliers affect the
mean.
The 1.5 IQR Rule for Outliers
47
Ayesha Aziz Prova,
Lecturer, CSE, CWU
outliers
48
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Boxplots and outliers
Consider our NY travel times data. Construct a boxplot.
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85
0 10 20 30 40 50 60 70 80 90 49
Tra v e lTim e Ayesha Aziz Prova,
Lecturer, CSE, CWU
THANKS
50
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???