ENVR 5320
Environmental Data Analysis
Lecture 1
Dr. Zhi NING
Division of Environment and Sustainability
The Hong Kong University of Science and Technology
Agenda
• Environmental problems and statistics
• Brief review of statistics
• Get familiar with the Excel tools
• Statistical distribution measures
• Probability distributions
2
Environmental Problems and Statistics
• The goal of statistics
– to make discovery process efficient.
• Environmental laws and regulations:
– toxic chemicals;
– water quality criteria;
– air quality criteria.
• Environmental data
– the limit of detection;
– acute and chronic toxicity criteria;
– cancer potency factors.
3
Environmental Problems and Statistics
• Use statistic tools to understand the nature
4
Structure of course teaching
• Introduction
– engineering problem and statistical method.
• Case study
– introduce a specific environmental example with
real world data
• Method
– give a brief explanation of statistical method that is
used to prepare the solution.
• Analysis
– show how the data suggest and influence the
method of analysis and give the solution.
5
Brief Review of Statistics
• Population and sample
– A population is a very large set of N observations
(or data values) from which the sample of n
observations can be imagined to have come.
• Two types of statistics:
– DESCRIPTIVE
• a way of summarizing the complexity of data with a
single number.
– INFERENTIAL
• answer the question, "To what extent can these findings
be GENERALIZED?
6
Brief Review of Statistics
•DESCRIPTIVE Statistics
– For one variable ("univariate analysis"):
– Measures of "CENTRAL TENDENCY" (averages) and
of DISPERSION or variance around that average.
– Examples: Means, Modes, Medians, Standard
Deviation, quartiles etc
– For multiple variables:
– The strength of relationship between two variables
(bivariate analysis) or among a set of variables
(multivariate analysis)
– Examples: correlation coefficient
7
Brief Review of Statistics
•INFERENTIAL Statistics
– Measures of the SIGNIFICANCE of the relationship
between two or more variables. Significance refers to
the probability that the findings could be attributed to
sampling error.
– Appropriate statistics depend on the LEVEL OF
MEASUREMENT OF THE DEPENDENT VARIABLE
(and of the independent variable).
– Example: t-Test, ANOVA (F-ratio)
8
Let’s get Familiar with Excel Advanced
Tools
• Formula in Excel
• Hidden Developer functions in Excel.
• Practice calculations in Excel data example
• Good practice in using Excel
9
Excel basics I
• Use of formula
• Use of $
• Use of shortcut to go to cells
• Note the black and white cross
• Plot
• Use of Ctrl + Shift + Enter for array calculation
• Developer tool
• ActiveX
10
Statistical distribution measures
• Central values
– Arithmetic mean, Geometric mean
– Mode, Median
• Measures of spread
– The range
– The interquartile range (IQR)
– Standard deviation, variance
– Coefficient of variation (CoV)
• Quartiles, Quantiles and percentiles
11
Statistical distribution measures
• Central values
– Arithmetic mean Average(a,b,c)
– Geometric mean
–
Geomean(a,b,c)
– Mode: value with highest probability of occurrence
– The median: central value of the ordered data
Median(a,b,c)
• Trimmed mean:
– e.g. 5 percent trimmed mean is the average of the
data between 5th and 95th percentiles 12
Statistical distribution measures
• Influence of the shape of the data distribution
• “heavy tails”.
• Arithmetic mean is
• The “heaviness” of the
influenced by high
tails depends degrees of
values;
freedom (df)
• G is same as median
• G best represents
• Right skewed
• Higher df leads to
normal dist.
13
• Bimodal distribution in nature
• The implications
14
Statistical distribution measures
• Measures of spread
– The range (MIN and MAX)
– The interquartile range (IQR)
Percentile (array, k)
Quartile (array, 0/1/2/3/4)
IQR=0.7413*(Q3-Q1)
– The standard deviation
15
Statistical distribution measures
• Measures of spread
– Variance
VAR(array)
– Coefficient of variation (CV)
16
Statistical distribution measures
• Measures of spread
– Quartiles, quantiles and percentiles
Quartile (array, 0/1/2/3/4)
Percentile (array, 0.05/0.10/0.95)
– Skewness:
• measure of symmetry of data distribution
Skew (array)
0 is symmetric; <0, left skewed; >0, right
skewed.
17
Statistical distribution measures
• Frequency distributions
– Identify cutting points to divide the data into
categories. The cutoff points should be chosen to
divide the data fairly evenly.
Frequency (data_array,bin_array)
PRESS SHIFT/CTRL/ENTER
Bin Frequency
1 10 2
2 20 0
3 30 2
4 40 3
5 50 5
6 60 4
7 70 2
8 80 0
9 90 1
10 100 1 18
Statistical distribution measures
• Accuracy, Bias and Precision
– Bias measures systematic errors
– Precision measures the degree of scatter in the
data
– Accuracy is a function of both bias and precision.
A known concentration of 8.00 mg/L.
19
Probability distributions
• The Normal Distribution
– Often called Gaussian distribution
– Characterized completely by N(η, σ2 ), “a normal
distribution with mean η and variance σ2 .
20
Read and type Greek letters correction
• Alt 956
• Alt 963
• Alt 961
• Alt 960
https://www.thespruceeats.com/the-greek-
21
alphabet-1705558
Probability distributions
• The Normal Distribution
1. The vertical axis (probability density) is scaled
such that area under the curve is unity (1.0).
2. The standard deviation σ measures the distance
from the mean to the point of inflection.
3. The probability that a positive deviation from the
mean will exceed one σ is 0.1587.
4. Because of symmetry, the probabilities are the
same for negative deviations
5. The chance that a deviation in either direction will
exceed 2σ is 2(0.0228) = 0.0456
22
Probability distributions
• NORM.DIST(x, mean, standard_dev, cumulative)
– Returns the normal cumulative distribution of with specific η and σ.
– Returns α value with given z and η σ values.
• NORM.INV (probability, mean, standard_dev)
– Returns the inverse of the normal cumulative distribution for η and σ.
– Returns z value with given α, η and σ values
• NORM.S.DIST (z, cumulative)
– Returns the standard normal cumulative distribution of with η=0 and σ=1
• NORM.S.INV (probability)
– Returns the inverse of the standard normal distribution with η=0 and σ=1
• Cumulative or not?
• Left tailed or right tailed?
• How to generate a normal distribution in excel?
23
Probability distributions
• Examples
– A normal distribution with η=8mg/L and σ=1 mg/L;
– Look for the value with 95% of data below?
– Look for the probability that the value is read
below 6.4mg/L?
– How to draw a normal distribution in Excel?
– Use function: norm.inv(rand(),8,1,1)
24
Probability distributions
• t distribution
– In normal distributions, both η and σ are known;
– In practice, σ is often not known and we use Se to
replace σ:
– Bell shaped and symmetric but tails are wider.
– Width of the t distribution depends on degree of
freedom.
Guinness brewer
Gosset, 1908 25
“Student” as pen name
Probability distributions
• Part of the t table as function of and
26
Probability distributions
• T.INV (probability, degree of freedom)
– Returns the inverse of the left tailed Student t distribution
• T.INV.2T (probability, degree of freedom)
– Returns the inverse of the two tailed Student t distribution
• T.DIST (x, degree of freedom, cumulative)
– Returns the left tailed Student t distribution
• T.DIST.RT (x, degree of freedom, cumulative)
– Returns the right tailed Student t distribution
• T.DIST.2T (x, degree of freedom, cumulative)
– Returns the two tailed Student t distribution
• If we enter α as probability and n-1 as Deg_freedom, then T.INV
outputs tn-1, 1-α/2, the 1-α/2 th percentile of a t distribution with n-1
degrees of freedom.
27
Probability distributions
• Example
– What is the 97.5th percentile of a t distribution with
degree of freedom 24 ?
– T.INV.2T(0.05, 24)=2.06
OR -T.INV(0.025,24)
– What is the probability of t value larger than 2.064
in a t distribution with degree of freedom 24?
– T.DIST.2T(2.064,24)
28
Distribution of average and variance
• Consider a sampling distribution of the
average, with many random samples of size n
were collected from a population
• Sample standard deviation:
• Standard error of the mean is:
29
Distribution of average and variance
• Central limit effect:
– If parent distribution where the samples come
from is normal, the distribution of average is
normal
– If the parent distribution is not normal, the
distribution of average will be more nearly normal
than the parent one.
– With increasing number of sample n, the
distribution becomes increasingly more normal.
30
Distribution of average and variance
• How to estimate the t statistic?
– From normal parent population to samples with t
distribution with df= n-1:
– The sample variance s2 is distributed as Chi-
square distribution:
31
Distribution of average and variance
• Example:
From Sd to Se, 0.266
NORM.DIST(7.51,8,0.27,1) With t=-1.842 and =26,
T.DIST(-1.842,26,1)
N(8,0.27)
32
Tutorial session