Chapter 1 Descriptivestatistics
Chapter 1 Descriptivestatistics
BFC 34303
CIVIL ENGINEERING STATISTICS
Chapter 1
Descriptive Statistics
Faculty of Civil Engineering and Built Environment
Universiti Tun Hussein Onn Malaysia
What is ‘statistics’?
Statistics is the science that deals with collecting, classifying, presenting,
describing, analysing and interpreting data to enable us to draw
conclusions and make reasonable decisions.
1
16/10/2022
Descriptive statistics
The activity of collecting, classifying, presenting and describing
quantitative data.
Methods for organising (frequency table), representing (graphs) and
summarising data (central tendency and variability).
Inferential statistics
The part dealing with techniques and methods of interpretation of the
results obtained from the descriptive statistics.
Population Sample
Population is the entire A portion of population selected
(complete) collection of data for study.
whose properties are analysed.
It contains all the subjects of A sample is any set of entities,
interest. cases, subjects, items or
experimental units chosen from
Can be of any size, its items
the population.
need not be uniform but must
share at least one measurable
feature.
2
16/10/2022
Random Sample
A random sample is a sample selected in such a way that each element
of the population has the same chance of being selected.
Parameter
Parameter is a numerical measurement describing some characteristics
of a population.
Eg. the population mean and variance
Statistic
Statistic is a numerical measurement describing some characteristics of
a sample.
Eg. the sample mean and variance
5
Variable
Any measured characteristic or attribute that differs for different
elements.
For example, if the weight of 30 people were measured, then weight
would be a variable.
Can be classified as quantitative or qualitative.
Quantitative Variable
The variable being studied is numeric and measured on an ordinal,
interval or ratio scale.
Eg. Ambient temperature, vehicular speed and walking distance.
3
16/10/2022
Qualitative Variable
The variable being studied is non-numeric and measured on a nominal
scale.
Also called ‘categorical’ variable.
Eg. Gender, eye colour and educational level.
Data
A set of data is a collection of observations, measurements or
information obtained for a study.
It can be classified as qualitative data or quantitative data.
4
16/10/2022
Ungrouped Data
Raw data that is not in the term of interval.
Frequency distribution has been arranged in order.
Example:
Weight of seven students: 56, 74, 68, 90, 52, 48, 65
Number of cars owned per household:
10
10
5
16/10/2022
Grouped Data
Data is grouped according to class intervals before the frequency
distribution is assigned.
Example:
Height of students in a class:
11
11
Measures of Location
Median
Mean Percentile
Measures
Mode of Quartile
Location
12
12
6
16/10/2022
Central tendency is a
statistical measure that Mean
determines a single value
that accurately describes the
center of the distribution and Mode Median
represents the entire
distribution of scores.
The goal is to identify the Measures
single value that is the best of Central
Tendency
representative for the entire
set of data.
13
13
14
14
7
16/10/2022
Quartiles
Quartiles are values that divide a data set into four parts containing an
approximately equal number of observations.
The total of 100% is split into four equal parts (four quarters):
Q1 Q2 Q3
Interquartile Range = Q3 – Q1
15
Percentiles
Percentiles divide a set of data which are arranged in ascending order
into 100 equal parts.
A percentile is a measure used to indicate the value below which a given
percentage of observations in a group of observations fall.
For example, the 25th percentile is the value below which 25% of the
observations may be found.
Note:
25th percentile (P25) = First quartile (Q1)
50th percentile (P50) = Second quartile (Q2), which is also the median
75th percentile (P75) = Third quartile (Q3)
16
16
8
16/10/2022
Measures of Dispersion
Variance
Standard
Range
Deviation
Measures
of
Dispersion
17
17
Measures of Dispersion
Measures of dispersion (or variation) describe how spread out a set of
data is, or the extent of the variability in individual items of the distribution.
Let us look at the following data sets to see how measures of central
tendency is different from measures of dispersion:
Most of the numbers in data set 1 are close to the mean value, while in
data set 2 the numbers are spread away from the mean. The difference in
the spread can be determined by a measure of dispersion.
18
18
9
16/10/2022
Measures of Dispersion
However, range is not a good measure of dispersion because it is
influenced by the extreme values and the calculation does not cover all
observations.
Variance and standard deviation are most useful and widely used
measures of dispersion. Although they are influenced by the extreme
values, the calculations cover all the observations.
19
19
Variance
Variance (s2 or s2) is the average of the squared differences from the
mean.
Standard Deviation
Standard deviation (s or s) a measure of dispersion of observations
within a data set. It is simply the square root of the variance.
If the observations are all close to the mean, then the standard deviation
is close to zero.
If many observations are far from the mean, then the standard deviation
is far from zero.
If all the observations are equal, then the standard deviation is zero.
20
20
10
16/10/2022
σ 𝑥 − 𝑥ҧ 2
𝑠2 =
𝑛−1
σ 𝑥 − 𝑥ҧ 2
𝑠=
𝑛−1
21
21
Stem-and-Leaf Diagram
Stem Leaf
22
22
11
16/10/2022
23
23
24
24
12
16/10/2022
Distribution of Data
A symmetric curve (bell-shaped) is one in which both sides of the
distribution would exactly match the other if the figure were folded over
its central point. This is called a normal distribution.
An example is shown below:
25
25
Positive skew
26
26
13
16/10/2022
Negative skew
27
27
The distribution shows that most data are clustered at the right. The left
tail extends farther from the data centre than the right tail. Therefore, the
distribution is skewed to the left or negatively skewed.
28
28
14
16/10/2022
Box-and-Whisker Plot
A box-and-whisker plot (also called a box plot) displays the five-number
summary of a set of data.
In a box plot, we draw a box from the first quartile to the third quartile. A
vertical line goes through the box at the median.
29
29
70
max
Horizontal Box-and-Whisker 60
Q1 Q2 Q3 50
min max
40 Q3
0 10 20 30 40 50 60 70
30
Q2
20
Vertical Box-and-Whisker 10
min
0
30
30
15
16/10/2022
31
31
min max
Q1 Q2 Q3
10 20 30 40 50 60 70 80 90 100
The data lies within the upper and lower inner fence, so the data has no outlier.
min max
Q1 Q2 Q3
10 20 30 40 50 60 70 80 90 100
32
16
16/10/2022
33
33
Q1 Q2 Q3
min max
34
34
17
16/10/2022
Q1 Q2 Q3
min max
35
35
Q1 Q2 Q3
min max
36
36
18
16/10/2022
Median Percentile
Mean Quartile
Measures
Mode Decile
of Location
37
37
Standard Interquartile
Deviation Range
Variance Range
Measures
of
Dispersion
38
38
19
16/10/2022
Formula
σ 𝑓𝑥
Mean, 𝑥ҧ = σ𝑓
where x = data and f = frequency
𝑑1
Mode = 𝐿𝑚 + c
𝑑1 +𝑑2
39
39
𝑛
−𝐹𝐿
Median = 𝐿𝑚 + 2
c
𝑓𝑚
𝑘
𝑛−𝐹𝐿
4
Quartile, 𝑄𝑘 = 𝐿𝑘 + 𝑐𝑘
𝑓𝑘
40
40
20
16/10/2022
𝑘
𝑛−𝐹𝐿
100
Percentile, 𝑃𝑘 = 𝐿𝑘 + 𝑐𝑘
𝑓𝑘
𝑘
𝑛−𝐹𝐿
10
Decile, 𝐷𝑘 = 𝐿𝑘 + 𝑐𝑘
𝑓𝑘
where k = 1, 2, 3, …
Lk = lower boundary of the class where Qk, Pk, Dk lies
n = total number of observations
FL = cumulative frequency of the class before the Qk, Pk, Dk class
fk = frequency of the class where Qk, Pk, Dk lies
ck = size of the class where Qk, Pk, Dk lies
41
41
σ 𝑓𝑥 2
σ 𝑓𝑥 2 −
σ𝑓
Variance, 𝑠 2 = σ 𝑓 −1
σ 𝑓𝑥 2
σ 𝑓𝑥 2 −
σ𝑓
Standard Deviation, 𝑠 = σ 𝑓 −1
42
42
21