0% found this document useful (0 votes)
14 views6 pages

Numerical Descriptors of Data

Uploaded by

hanyeelovesgod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Numerical Descriptors of Data

Uploaded by

hanyeelovesgod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Seoul National University Instructor: Junho Song

Dept. of Civil and Environmental Engineering [email protected]

457.212 Statistics for Civil & Environmental Engineers


In-Class Material: Class 03
Numerical Descriptors of Data (A&T 1.2-1.3, Supp #1)

Partial descriptors, measures or descriptors to capture (estimated from data)


i) Central tendency: median, sample (s.) mean,
ii) Dispersion: range, IQR, mean absolute deviation, s. variance, s. standard dev., s.c.o.v.
iii) Asymmetry: skewness
iv) Linear dependence: s. covariance, s. correlation coeff.

1. Measure of Central Tendency

(a) Median ( x0.5 ): the middle value of the data set, ( )-percentile, ( )-quantile, ( )-quartile

N odd even

x  N +1  xN / 2  + xN / 2 +1


 2 
  2
median
{10, 29, 35} {10, 29, 35, 49}
x0.5 = x0.5 =

(b) Sample mean ( x ): the average of the sample values

1 N
x= =  xi
N i =1

* Example 1: ( ) is less sensitive to “outliers” (extreme values) than ( ).

{1, 2, 3, …, 100, 106}

x0.5 =
x=

X1 = c(1:100,1000000)

median(X1) # quantile(X1, 0.5) should give the same result


mean(X1)

* Example 2: In the case of a multi-peak distribution, median and sample mean can be
significantly different.

1
Seoul National University Instructor: Junho Song
Dept. of Civil and Environmental Engineering [email protected]

Data Set (N = 2,001) x0.5 x

{1, ……, 1, 25, 100, ……, 100}

{24, ……, 24, 25, 26, ……, 26}

X2 = c(array(1,1000),25,array(100,1000))
X3 = c(array(24,1000),25,array(26,1000))

mean(X2)
mean(X3)

median(X2)
median(X3)

2. Measure of Dispersion

(a) Range: r =

~ nondecreasing function of the sample ( ), therefore not stable


~ unduly affected by high and low values
~ e.g. range of golf driving distances for 100 and 1,000 hits

(b) IQR (Inter Quartile Range) =

~ more stable
~ spread of ( )% population at the center
~ generally, ( x1− q − xq ) for small q can be used as a measure of dispersion ( q = 0.25 for
IQR)

AddisonCreek = read.table("AddisonCreek.txt", header=TRUE)


FR = AddisonCreek$FlowRate

range_FR = diff(range(FR))
IQR_FR = IQR(FR, type=2)

# minimum and maximum


min(FR)
max(FR)

How about using “the average of the deviations from the mean” as a measure of dispersion?

• Data set 1: {10, 20, 30, 40}


• Data set 2: {10, 10, 40, 40}

Question 1: Which data set has larger dispersion?

Question 2: What are the sample means?

Question 3: What is the average of the deviations for each data set?

2
Seoul National University Instructor: Junho Song
Dept. of Civil and Environmental Engineering [email protected]

Since “the average of the deviations” idea does not work …

(c) Mean Absolute Deviation ( d ): average of absolute deviations

1 N
d= = | xi − x |
N i =1

(d) Sample Variance ( s 2 ): average of squared deviations

1 N
s2 = = 
N i =1
( xi − x ) 2

(e) Sample Standard Deviation ( s ): square root of sample variance

d s2 s
Data Set 1
{10, 20, 30, 40}

Data Set 2
{10, 10, 40, 40}

(f) “Unbiased” sample variance and standard deviations: divide by (N–1) instead of (N)

X4 = c(10,20,30,40)
X5 = c(10,10,40,40)

mad_X4 = mean(abs(X4-mean(X4)))
mad_X5 = mean(abs(X5-mean(X5)))

var(X4)
var(X5)

sd(X4)
sd(X5)

Comparison of dispersion of data sets with different units or quantities? Consider


unbiased sample variances of {1, 2, 3} and {2, 4, 6}.

We need a measure of dispersion that is not affected by “scaling” or “unit changes.”

(g) Sample Coefficient of Variation (C.O.V.; δ̂ )

δ̂ =

- dimensionless
- independent of ( ) or ( )

3
Seoul National University Instructor: Junho Song
Dept. of Civil and Environmental Engineering [email protected]

- useful for comparing ( ) of data sets with different magnitude or quantity


- does not work when x is close to ( )

Sample c.o.v. of {1, 2, 3} and {2, 4, 6}?

X6 = c(1,2,3)
X7 = c(2,4,6)
sd(X6)
sd(X7)
sd(X6)/abs(mean(X6))
sd(X7)/abs(mean(X7))

3. How to install R packages

• Collections of functions and data sets developed by the community


• Can make R more powerful by improving existing base R functions, or by adding new
ones

- Example : R package "e1071"

install.packages("e1071") # install packages


library(e1071) # load and attach add-on packages

4. Measure of Asymmetry

(a) Sample Coefficient of Skewness (θ̂ )

θ̂ =

- Symmetric distribution:
- Asymmetric distribution:

If positive: “positive skewness” or “skewed to the ( )”


If negative: “negative skewness” or “skewed to the ( )”

skewness(FR) # Compute the skewness coeff. using the function


skewness in "e1071" package

5. Measure of Linear Dependence between Two Data Samples

Data given in pairs, i.e., ( x1 , y1 ), ( x2 , y2 ),..., ( xN , y N ) and interested in the dependence.


• “the larger xi , the larger yi ”: ( ) linear dependence
• “the larger xi , the smaller yi ”: ( ) linear dependence

Can be seen from “scatter plots.” Numerically?

4
Seoul National University Instructor: Junho Song
Dept. of Civil and Environmental Engineering [email protected]

(a) Sample Covariance

s XY =
1
( )
N −1

~ the sign tells us the trend, but not about the ( ) of the dependence

(b) Sample Correlation Coefficient: divide the sample covariance by the product of sample
standard deviations
rXY =

- dimensionless
- Bounded by ( ) and ( ): [ ]  rxy  [ ]
- rXY  −1 : strong ( ) linear dependence
- rXY  1 : strong ( ) linear dependence
- rXY  0 : no significant linear dependence

Sketches of scatter plots of these three cases?

HT = AddisonCreek$Height
cov(FR,HT)
cor(FR,HT)

6. Example 1: Computational simulations of steel structures under earthquake ground motions

Download the dataset ‘Kim_Collapse.txt’ from the eTL website (generated during Mr. Taeyong
Kim’s PhD research)
Related reference: Deniz, D., J. Song, and J.F. Hajjar (2018). Energy-based sidesway collapse fragilities for ductile structural
frames under earthquake loadings. Engineering Structures. Vol. 174, 282- 294.

# Exercise 01: Scatter plot of Velocity Ratio (VR) and Drift Ratio (DR)

Kim = read.table("Kim_Collapse.txt")
VR = Kim$EquivalentVelocityRatio
DR = Kim$DriftRatio

plot(DR,VR)

# Exercise 02: Compare partial descriptors of two sets - median, mean,


maximum, minimum, variance, standard deviation, and c.o.v.

median(VR); mean(VR); max(VR); min(VR); var(VR); sd(VR);


sd(VR)/abs(mean(VR))
median(DR); mean(DR); max(DR); min(DR); var(DR); sd(DR);
sd(DR)/abs(mean(DR))

# Exercise 03: Compare boxplots of DR and VR (before/after scaling by


means)

boxplot(DR,VR); boxplot(DR/mean(DR),VR/mean(VR))

5
Seoul National University Instructor: Junho Song
Dept. of Civil and Environmental Engineering [email protected]

7. Example 2: Sample correlation coefficient between DO (dissolved oxygen) and BOD


(biochemical oxygen demand) in water?

You might also like