Statistics
Statistics
INTRODUCTION
DESCRIPTIVE STATISTICS
• comprises those methods concerned with collecting
and describing a set of data so as to yield meaningful
information.
STATISTICAL INFERENCE
• Comprises those methods concerned with analysis of
a subset of data leading to predictions or inferences
about the entire set of data
POPULATION
• Consists of totality of the observations with which we
are concerned
SAMPLE
• Subset of a population
SIMPLE RANDOM SAMPLE
• A simple random sample of n observations is a
sample that is chosen in such a way that every
subset of n observations of the population has
the same probability of being selected.
SYSTEMATIC SAMPLING
• Individuals are selected at regular intervals from
a list of the whole population. The intervals are
chosen to ensure an adequate sample size. For
example, every 10th member of the population
is included. This is often convenient and easy to
use, although it may also lead to bias.
STRATIFIED SAMPLING
• In this method, the population is first divided into sub-groups (or
strata) who all share a similar characteristic. It is used when we might
reasonably expect the measurement of interest to vary between the
different sub-groups. Gender or smoking habits would be examples of
strata. The study sample is then obtained by taking samples from
each stratum.
• In a stratified sample, the probability of an individual being included
varies according to known characteristics, such as gender, and the aim
is to ensure that all sub-groups of the population that might be of
relevance to the study are adequately represented.
CLUSTERED SAMPLING
• In a clustered sample, sub-groups of the population are used
as the sampling unit, rather than individuals. The population
is divided into sub-groups, known as clusters, and a selection
of these are randomly selected to be included in the study.
All members of the cluster are then included in the study.
Clustering should be taken into account in the analysis.
QUOTA SAMPLING
• This method of sampling is often used by market
researchers. Interviewers are given a quota of subjects of a
specified type to attempt to recruit. For example, an
interviewer might be told to go out and select 20 adult men
and 20 adult women, 10 teenage girls and 10 teenage boys
so that they could interview them about their television
viewing. There are several flaws with this method, but most
importantly it is not truly random.
CONVENIENCE SAMPLING
• Convenience sampling is perhaps the easiest method of
sampling, because participants are selected in the most
convenient way, and are often allowed to chose or volunteer
to take part. Good results can be obtained, but the data set
may be seriously biased, because those who volunteer to
take part may be different from those who choose not to.
SNOWBALL SAMPLING
• This method is commonly used in social sciences when
investigating hard to reach groups. Existing subjects are
asked to nominate further subjects known to them, so the
sample increases in size like a rolling snowball. For example,
when carrying out a survey of risk behaviors amongst
intravenous drug users, participants may be asked to
nominate other users to be interviewed.
STATISTICAL MEASURES OF DATA
PARAMETER
• Any numerical value describing a characteristic of a
population is called a parameter.
STATISTIC
• Any numerical value describing a characteristic of a
sample is called a statistic.
MEASURES OF CENTRAL LOCATION
𝑁
σ𝑖=1 𝑥𝑖
𝜇=
𝑁
Example
• The number of employees at 5 different drugstores
are 3, 5, 6, 4, and 6. Treating the data as a population,
find the mean number of employees for the 5 stores.
3+5+6+4+6
𝜇= = 4.8
5
SAMPLE MEAN
𝑛
σ𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛
Example
• A food inspector examined a random sample 7 cans of
a certain brand of tuna to determine the percent of
foreign impurities. The following data were recorded:
1.8, 2.1, 1.7, 1.6, 0.9, 2.7, and 1.8. Compute the
sample mean.
σ𝑘𝑖=1 𝑤𝑖 𝜇𝑖
𝑘 .
σ𝑖=1 𝑤𝑖
COMBINED MEAN
• Suppose that k finite populations having 𝑁1 , 𝑁2 , . . . , 𝑁𝑘
measurements, respectively, have means 𝜇1 , 𝜇2 , … , 𝜇𝑘 . The combined
population mean, 𝜇𝑐 , for all the populations is
σ𝑘𝑖=1 𝑁𝑖 𝜇𝑖
𝜇𝑐 = 𝑘
σ𝑖=1 𝑁𝑖
28 83 +32 80 +35(76)
𝜇𝑐 = = 79.41
28+32+35
GEOMETRIC MEAN
• The geometric mean, G, of k positive numbers 𝑥1 , 𝑥2 , . . . , 𝑥𝑘 is the kth
root of their product; that is,
𝐺 = 𝑘 𝑥1 𝑥2 . . . 𝑥𝑘
Example
• Find the geometric mean of 1, 4, and 128.
3
𝐺= (1)(4)(128) = 8
HARMONIC MEAN
• The harmonic mean, H, of k numbers 𝑥1 , 𝑥2 , . . . , 𝑥𝑘 is
the number k divided by the sum of the reciprocals of
the k numbers; that is,
𝑘
𝐻= 1
σ𝑘
𝑖=1 𝑥𝑖
MEASURES OF VARIATION
• Range
• Variance
• Standard deviation
RANGE
• The range of a set of data is the difference
between the largest and smallest number in the
set.
• Range = Highest - Lowest
Example
• The IQs of 55 members of a family are 108, 112, 127,
118, and 113. Find the range.
7+5+9+7+8+6
𝜇= =7
6
6 2
2
σ (𝑥
𝑖=1 𝑖 − 7)
𝜎 =
6
2 2 2 2 2 2
(0) + (−2) + (2) + (0) + (1) + (−1)
𝜎2 =
6
5
𝜎2 =
3
15
𝜎 = = 1.29
3
SAMPLE VARIANCE
• Given a random sample 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the sample
variance is
σ𝑛 (𝑥 − ҧ
𝑥) 2
𝑠2 = 𝑖=1 𝑖
𝑛−1
Example
• A comparison of coffee prices at 4 randomly selected
grocery stores in San Diego showed increases from
the previous month of 12, 15, 17, and 20 cents for a
200-gram jar. Find the variance of this random sample
of price increases.
• Sample mean:
12 + 15 + 17 + 20
𝑥ҧ = = 16 𝑐𝑒𝑛𝑡𝑠
4
• Sample variance
4 2
σ 𝑖=1(𝑥𝑖 − 16)
𝑠2 =
3
(12 − 16)2 +(15 − 16)2 +(17 − 16)2 +(20 − 16)2
𝑠2 =
3
(−4)2 +(−1)2 +(1)2 +(4)2
𝑠2 =
3
2
34
𝑠 =
3
SAMPLE STANDARD DEVIATION
• Square root of the sample variance
CHEBYSHEV’S THEOREM
1
• At least the fraction 1 − of the measurements of any set of data
𝑘2
must lie within k standard deviations of the mean.
Example
1 3
• For k=2 the theorem states that at least 1 − = or 75%, of the
,
22 4
measurements must lies within 2 standard deviations on either side
of the mean. That is ¾ or more observations of a population must lie
in the interval 𝜇 ± 2𝜎
z SCORES
• Any observation, x, from a population with mean 𝜇 and standard
deviation 𝜎, has z score or z value defined by
𝑥−𝜇
𝑧= .
𝜎
Example
• Find the z scores corresponding to student’s grads in chemistry and
economics.
82 − 68
𝑓𝑜𝑟 𝑐ℎ𝑒𝑚𝑖𝑠𝑡𝑟𝑦: 𝑧 = = 1.75
8
89 − 80
𝑓𝑜𝑟 𝑒𝑐𝑜𝑛𝑜𝑚𝑖𝑐𝑠: 𝑧 = = 1.50
6
Interpretation:
• We see that the student had a grade in chemistry that
was 1.75 standard deviations above the mean of the
chemistry grades, whereas in economics she was only
1.50 standard deviations above the mean of the
economics grades. Comparing these two z scores, we
can now say that the student’s relative performance in
chemistry was higher that her performance in
economics.
MEAN DEVIATION
• The mean deviation of a sample of n observations is defined to be
𝑛
σ𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
• Find the mean deviation of the sample 2, 3, 5, 7, and 8
2+3+5+7+8
𝑓𝑜𝑟 𝑚𝑒𝑎𝑛: 𝑥ҧ = =5
5
=2
COEFFICIENT OF VARIATION
• The standard deviation does not by itself tell us much about the
variability of a single set of data. Perhaps a more appropriate
measure is the coefficient of variation, defined by
𝑠 𝜎
𝑉 = × 100% 𝑜𝑟 𝑉 = × 100%
𝑥ҧ 𝜇
• Which expresses the SD as a percentage of the mean.
CHAPTER 3: STATISTICAL DESCRIPTION OF
DATA
FREQUENCY DISTRIBUTION
• Important characteristics of a large mass of data can be readily
assessed by grouping the data into different classes and then
determining the number by observations that fall in each of the
classes. Such arrangement, in tabular form is called frequency
distribution.
• Data that are represented in the form of frequency distribution are
called grouped data.
Frequency Distribution for the Weights of 50
Pieces of Luggage
Weight (Kilograms) Number of Pieces
7–9 2
10 – 12 8
13 – 15 14
16 – 18 19
19 – 21 7
Class limits: for interval 10 – 12, the smaller number 10, the lower class limit, and the larger
number, 12 is the upper class limit.
Class Boundaries: 9.5 – lower class boundary and 12.5 upper class boundary
Class Frequency: the number of observations falling in a particular class
Class Width: difference between the upper and lower class boundaries of a class interval
Class Mark or Class midpoint: The midpoint between the upper and lower class boundaries
Class Interval Class Class Mark, x Frequency, f Cumulative
Boundaries Frequency, cf
7–9 6.5 – 9.5 8 2 2
10 – 12 9.5 – 12.5 11 8 10
13 – 15 12.5 – 15.5 14 14 24
16 – 18 15.5 – 18.5 17 19 43
19 – 21 18.5 – 21.5 20 7 50
Graphical Representations
• Bar chart • Histogram
• Although the bar chart provides immediate
information about a set of data in a condensed form,
we are usually more interested in a related pictorial
representation call a histogram. A histogram differs
from a bar chart in that the bases of each bar are the
class boundaries rather than the class limits. The use
of class boundaries for the bases eliminates the
spaces between the bars to give a solid appearance.
Frequency polygon
• Constructed by plotting the class frequencies against
class marks and connecting the consecutive points by
straight lines.
Cumulative Frequency Polygon or Ogive
• Obtained by plotting the cumulative frequency less
than any upper class boundary against the upper class
boundary and joining all the consecutive points by
straight lines
SYMMETRY AND SKEWNESS
• A distribution is said to be symmetric if it can be folded
along a vertical axis so that the two sides coincide.
• A distribution that lacks symmetry with respect to a
vertical axis is said to be skewed.
• Positively Skewed –skewed to right; it has a long right
tail compared to much shorter left tail.
• Negatively Skewed – skewed to left
• For a perfectly symmetrical distribution the
mean and median are identical and the value of
SK is zero (bell-shaped)
• Skewed to the left, the mean is less than the
median and the value of SK will be negative
• Skewed to the right, the mean is greater than
the median and the value of SK will be positive
EMPIRICAL RULE
• Given a bell-shaped distribution of measurements, then the
approximately
68% - 1 SD
95% - 2 SD
99.7% - 3 SD
FRACTILES OR QUANTILES
4.1 + 4.2
ℎ𝑒𝑛𝑐𝑒, 𝑃85 = = 4.15
2
1.6 2.6 3.1 3.2 3.4 3.7 3.9 4.3
1.9 2.9 3.1 3.3 3.4 3.7 3.9 4.4
2.2 3.0 3.1 3.3 3.5 3.7 4.1 4.5
2,5 3.0 3.2 3.3 3.5 3.8 4.1 4.7
2.6 3.1 3.2 3.4 3.6 3.8 4.2 4.7
19.2 − 7
𝑃48 = 2.95 + 0.5 = 3.36
15
Mean of Grouped Data
𝑘
σ𝑖=1 𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛 =
𝑁
𝑓𝑖 - class frequency
𝑥𝑖 - class mark
𝑁 − 𝑛𝑜. 𝑜𝑓 𝑑𝑎𝑡𝑎
Class Interval Class Class Frequency Cumulative
Boundaries Midpoint f Frequency
cf
1.5 – 1.9 1.45 – 1.95 1.7 2 2
2.0 – 2.4 1.95 – 2.45 2.2 1 3
2.5 – 2.9 2.45 – 2.95 2.7 4 7
3.0 – 3.4 2.95 – 3.45 3.2 15 22
3.5 – 3.9 3.45 – 3.95 3.7 10 32
4.0 – 4.4 3.95 – 4.45 4.2 5 37
4.5 – 4.9 4.45 – 4.95 4.7 3 40