Week 2
Week 2
[email protected]
DLZNK464L9
Week 2
o Numeric: quantitative
DLZNK464L9
─ Interval-scaled, Ratio-scaled
• Ratio
○ Inherent zero-point
○ We can speak of values as being an order of magnitude larger than
the unit of measurement (10 kelvins is twice as high as 5 kelvins).
─ E.g., the temperature in Kelvin, length, counts, monetary
quantities.
[email protected]
DLZNK464L9
Ratio:
Absolute
Interval: zero
Distance is
Ordinal: meaningful
Attributes
Nominal: can be
Attributes ordered
are only
named
weakest
[email protected]
DLZNK464L9
• Continuous attribute
[email protected]
○ Has real numbers as attribute values. E.g., temperature, height, or weight.
DLZNK464L9
• We talked about the need for the absolute “0” i.e., ratio attribute.
• We learned the difference between discrete and continuous attributes and that discrete attributes
can have integer values. For example age, or binary values, whereas continuous attributes take
on any value within a finite or infinite interval.
● Median:
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
N=3194
[email protected]
DLZNK464L9
L1 = 21
Total observations = 3194
=200+450+300 = 950 3194/2 = 1597
Freqmedian = 1500
width = 30
median = 33.94
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Making sense of the estimation
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
Median, mean and mode of symmetric, positively and negatively skewed data.
[email protected]
DLZNK464L9
• Quantile plot: each value xi is paired with fi, indicating that approximately fi
x 100% of data are ≤ xi .
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
• The two histograms shown may have the same boxplot representation.
• The same values for min, Q1, median, Q3, and max, but they have rather different data
distributions.
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
[email protected]
Normal distribution,
DLZNK464L9
Testing the difference of one sample population standard
One sample t test n-1 mean, x-bar with a given mean, μ deviation, σ is unknown
Testing the difference of two sample
means when population variances
Two sample t test n1+n2-2 unknown but considered equal Normal distribution
Rejection area
● Two-tailed test
1. Null hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar ≠ 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
● Distance/similarity matrix
○ n data points, but registers only
[email protected]
DLZNK464L9
the distance/similarity
○ Is often a symmetric matrix
○ Single mode: (dis)similarity
Observations ( cat1 and cat2 ) are described by nominal values of color, size, sleep time.
Objects
[email protected]
DLZNK464L9 Color Size Sleep time
cat1 yellow small <5 hours
cat2 yellow medium 5-8 hours
The number of attributes having the same/different values for the observations ( eg, cat1, cat2 in the
previous table ) are counted by using the binary attribute table forming the contingency table as
shown below:
Object J
[email protected]
1 0 sum
DLZNK464L9
Object I 1 q r q+r
0 s t s+t
[email protected]
DLZNK464L9
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h norm)
[email protected]
● Properties
DLZNK464L9
[email protected]
DLZNK464L9
○ When h → -∞.
○ This is the minimum difference between any component (attribute) of the vectors
Manhattan (L1)
[email protected]
DLZNK464L9
Euclidean (L2)
Supremum
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization of numerical values
● Values measured on different scales can not be compared directly.
Student A: SAT = 1800
Student B: ACT = 24
Which student performed better relative to other test-takers?
● Normalization used widely with multi-dimensional datasets involving different scales: clustering,
multidimensional scaling, principal component analysis, etc.
[email protected]
DLZNK464L9
SAT ACT
Mean 1500 21
Standard
deviation 300 5
Where
Standardized measure:
● MAD is more robust to outliers than the standard deviation because, in the
former, the differences with the mean are not squared.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for ordinal variables
● Order is important, e.g., rank
Math grade: A, B, C, D, E.
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
• The angle between any two vectors (documents) can be used as a measure of the similarity between
the two documents:
• Cosine similarity is in [0, 1] : If d1 and d2 are two vectors (e.g., term-frequency vectors) then
DLZNK464L9 d = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
[email protected]
1
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1∙ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.48*4.12) = 0.94