Intreb Statist
Intreb Statist
You will learn here to organize and describe data sets. The goal is to make
the data easier to understand by describing trends, patterns or special
characteristics.
On the horizontal axes of the graph, we specify the labels that are used
for each of the classes. Then, using a bar of fixed width drawn above each label,
we extend the height of the bar until we reach the frequency, relative
frequency, or percent frequency of the label as indicated by the vertical axes.
The bars are separated to emphasize the fact that the labels are different
categories.
It is recommended also that the width be the same for each class. Find
the range of the data set, and then divide the range by the number of classes
and round up to find the class width.
In practice, the number of classes and the appropriate class width are
determined by trial and error.
Class limits must be chosen so that each data value belongs to one and
only one class. Having the lower limit of the first class, you can find the lower
limit of the second class by adding the class width to the lower limit of the first
class. The upper limit of the first class will be one less than the lower limit of
the second class. The limits of the rest of the classes are determined similarly.
Thus far, we have seen tabular and graphical methods that are used to
summarize the data for one variable at a time. Often managers and decision
makers are interested in tabular and graphical methods that will help in
understanding the relationship between two variables. Such methods are,
respectively, cross-tabulation and scatter diagram.
Consider a data set for a single variable, and containing n items, or data
values. We will define numerical measures of data location and dispersion. If the
measures are computed for data from a sample, they are called sample statistics,
and if the measures are computed for data from a population, they are called
population parameters.
=
xi x =
xi
Population Mean: N Sample Mean: n ,
where xi represents the sum of the data values in the data set, N is the
The median of a data set is the value that lies in the middle of the data when
the data set is ordered. If the data set has an odd number of entries, the median is the
middle data entry. If the data set has an even number of entries, the median is the
mean of the two middle data entries.
Although the mean is the more commonly used measure of central location, in
some situations the median is preferred. The mean is influenced by extremely small or
large data values. So, whenever there are extreme data values, the median is often
preferred measure of central location.
The mode is the data value that occurs with greatest frequency. When the
greatest frequency occurs at two or more different values, then we say that data are
bimodal or, respectively, multimodal.
The mode is an important measure of location for qualitative data, since for their
case it makes no sense to speak of the mean or median.
As we have seen, the mean can be greatly affected when the data set contains
outliers. An outlier is a data entry that is far removed from the other entries in the
data set.
Fractiles are used to specify the position of a data entry within a data set.
Fractiles are numbers that divide an ordered data set into equal parts.
For instance, the median is a fractile because it divides an ordered data set into
two equal parts.
Percentiles are often used in education and health-related fields to indicate how
one individual compares with others in the group. For instance, test scores are often
expressed in percentiles. Scores in the 95 th percentile and above are unusually high,
while those in the 5th percentile and below are unusually low.
To find a certain percentile for the given data set, we will perform the next
operations.
Step 3. (a) If i is not an integer, then the next integer greater than i
The interquartile range (IQR) of a data set is the range for the middle 50% of
the data. It is the difference between the third and first quartiles:
IQR=Q3Q1 .
* Box Plot.
The interquartile range is used to represent data sets by box plots. The box
plot is a data analysis tool that highlights the important features of a data set. To
graph a box plot, you must know the following values:
3. The median
Q2
These five numbers are called the five-number summary of the data. Now, to
draw the box plot perform the following steps:
1. Find the five-number summary of the data set. 2. Draw a horizontal scale. 3.
box at
Q 2 . 5. Compute the lower limit according to the formula: lower limit
Q11.5 ( IQR ) . 6. Compute the upper limit according to the formula: upper limit
Q3+ 1.5 ( IQR ) . Data outside these limits are considered outliers. 8. Draw the
whiskers from the box to the lower and upper limits. 7. Finally, show the outliers.
As we know, the range of a data set is the difference between the maximum and
minimum data entries in the set. This is a measure of variation that uses only two
entries from the data set. There are two measures of variability that use all the entries
in a data set. These are variance and standard deviation.
The variance is based on the difference between each data value and the mean.
for a population) is called a deviation about the mean. Thus, for a sample, a
deviation is (
x ix ); for a population, the deviation is ( x i ). In calculation of the
variance, the deviations about the mean are squared. The average of the squared
deviations is called variance.
2
=
( x i )
2
N
N , where is the population size.
Definition.
Unlike variance, the standard deviation is measured in the same units as the
original data. For this reason, the standard deviation is more easily compared to the
mean and other statistics that are measured in the same units as the original data.
The standard deviation measures the spread about the mean of a distribution. It
shows a typical deviation from the mean. The more the entries are spread out, the
greater the standard deviation.
11. Empirical rule and Chebyshevs theorem. Examples.
1. About 68% of the data lie within one standard deviation of the mean.
2. About 95% of the data lie within two standard deviations of the mean.
3. About 99.7% of the data lie within three standard deviations of the mean.
Data values that lie more than two standard deviations from the mean are
considered unusual. Data values that lie more than three standard deviations from the
mean are considered very unusual.
1
1
Chebyshevs Theorem. At least k2 percent of any data set lies within
1 3
k=2 : At least 1 =
k 4 , or 75% of data lies within 2 standard deviations of
2
the mean.
1 8
k=3 : At least 1 =
k 9 , or 89% of data lies within 3 standard deviations of
2
the mean.
1 15
k=4 : At least 1 =
k 16 , or 94% of data lies within 4 standard deviations
2
of the mean.
One additional measure (statistic) is used when we are interested in how the
standard deviation is in relation to the mean. This measure is called coefficient of
variation and is calculated as follows.
Standard Deviation
Coefficient of variation
100%
Mean
sample covariance,
s xy , is defined by the following equation.
s xy =
( x ix )( y i y )
n1 .
respectively:
xy =
( x i x )( yi y )
N .
between these two variables. Finally, if the points are evenly distributed, the value of
the value we obtain for the covariance depends on the units of measurement for x
follows.
s xy
r xy = r xy sx
s x s y , where is sample covariance, is sample standard
xy
xy = xy x
x y , where is population covariance, is population standard
perfect (the points of the scatter diagram are not all on a straight line), the value of the
We can measure the center of probability distribution of a random variable, with its
expected value, or mean.
EX =MX = xP(x ) .
The mean of the random variable describes the typical outcome (value) of the random
variable. To study the variation of the outcomes, we can use the variance and the standard
deviation of the random variable.
= =
2
( xMX ) P( x)
2
.
* Binomial Distribution
There are many experiments for which the results can be reduced to only two outcomes:
success and failure. Such are the binomial experiments.
2. There are only two possible outcomes in each trial: success and failure.
3. The probability of success, denoted by p , does not change from trial to trial.
Consequently, the probability of failure, 1 p , denoted by q , also does not change from
trial to trial.
n trials. Let us denote by X the number of successes occurring in the n trials. Then
distribution associated with this random variable is called the binomial probability
distribution.
x x n x
P(x)=Cn p ( 1 p ) , where
p is the probability of a success in any trial.
There exist special tables developed to provide the probability of x successes in n
trials, with the given probability p of a success in one trial. They are called tables of
binomial probabilities, and are usually given in the books of probability and statistics.
To find the mean, variance and standard deviation for a discrete random variable, we
can use the formulas of the respective definitions. This formulas can be simplified when the
random variable has a binomial distribution.
Expected Value
EX=MX =np .
Variance
Standard Deviation
= 2= npq .
* Poisson Distribution
X might be the number of arrivals at a car wash in one hour. In such cases, the random
variable has a Poisson probability distribution if the following conditions are satisfied.
Poisson Experiment
1. The probability of occurrence is the same for any two intervals of the same length.
probability function.
where is the expected value or the mean number of the occurrences per interval, and
e= 2.71828. Note that the discrete random variable may assume an infinite number of
values ( x= 0, 1, 2, 3,...).
The Poisson probability distribution can be used to approximate the binomial probability
distribution when the probability p of a success in one trial is small and the number n of
trials is big. The approximation is good whenever p 0.05 and n 20. In such case put
probability function, denoted P( x) , which provides the probability for each value x of the
random variable. A discrete probability distribution must satisfy the following conditions:
One important advantage of defining a random variable and its probability distribution is
that once it is known, it is relatively easy to find the probability of diverse events of interest.
the values of a random variable. Recall that this function provides the probability of a
concrete value a discrete random variable can take. For a continuous random variable,
The first requirement is the analog of the requirement that P( x) 0 for the
discrete probability functions. The second condition is the analog of the condition that
the sum of the probabilities must equal 1 for a discrete probability function.
1 2 2
f ( x)= e( x ) / 2
2
where is mean, and is standard deviation for the probability distribution of
X .
1. The graph of the normal distribution, called normal curve, is bell-shaped and
is symmetric about the mean. Its points of inflection are one standard deviation from
the mean.
2. The highest point on the normal curve is at the mean, which is also the
median and mode of the distribution.
4. The standard deviation determines the width of the curve. Larger values
result in wider curves, showing more variability in the data.
5. The tails of the curve extend to infinity in both directions and never touch the
horizontal axis.
6. Probabilities for some commonly used intervals are: (a) 68.26% of the values
of the random variable are within plus or minus one standard deviation of its mean; (b)
95.44% of the values are in plus or minus two standard deviations of the mean; (c)
99.72% of the values are within three standard deviations of the mean. See the Figure.
To find the probability that a normal random variable takes values in a certain
segment, we must compute the area under the normal curve over that interval.
Probabilities for all normal distributions are computed by using the standard normal
distribution.
formula.
x
z=
.
This is the same formula as that for computing z -scores for a data set. Thus,
we can interpret Z as the number of standard deviations that the normal random
For the standard normal probability distribution, areas under the normal curve
have been computed and are available in special tables.
To compute probabilities for any normal random variable with mean and
* Sampling
We have defined earlier the two basic notions of statistics: a population and a
sample. Let us remind these definitions.
There are many methods of selecting a sample from a population. One of the
most common is simple random sampling. The definition and the process of selecting a
simple random sample depend on whether the population is finite or infinite.
A simple random sample of the size n from a finite population of the size
N is a sample selected such that each possible sample of the size n has the
Simple random sampling is not the only method for selecting a sample from a population. Let us see
some alternative sampling methods.
Stratified Random Sampling. In stratified random sampling, the population is first divided in groups
called strata, such that each element in the population belongs to one and only one stratum, as, for instance,
department, age, marital status, and so on. After the strata are formed, a simple random sample is selected
from each stratum. The results for strata samples are combined into one estimate of the population
characteristic of interest. If elements within strata are homogeneous, the strata will have low variance , and,
consequently, small sample sizes can be used to obtain good estimates of the strata characteristics.
Cluster Sampling. In this case the population is first divided into separate groups called clusters.
Each element of the population belongs to one and only one cluster. The elements of each cluster form a
sample. The best results are provided when the elements within a cluster are heterogeneous (not alike). Then
each cluster is a small-scale version of the entire population. Sampling a small number of clusters give good
estimates of the population parameters.
Cluster sampling is often applied in area sampling, where clusters are city-blocks schools, or other
defined areas. Many sample observation can be obtained from a cluster in a relatively short time.
* Point Estimation.
Let us return to the previous example and assume that a sample of 30 employees has been selected
and the corresponding data are recorded. To estimate the value of a population parameter, we compute the
value of a corresponding sample statistic. Thus, to estimate the population mean (population average
age) and the population standard deviation , we calculate the corresponding sample mean x (sample
average age) and the sample standard deviation s . In addition, by computing the proportion of employees
in the sample who have completed the training program, we find the sample proportion p , which can be
The statistical procedure in which we use the data from a sample to compute a value of a sample
statistic that serves as an estimate of a population parameter is called point estimation. Thus, x is the
point estimator of the population mean , s is the point estimator of the population standard
deviation , and p is the point estimator of the population proportion p . The actual numerical
values of x , s , and p obtained for a particular sample are called point estimates of the respective
population parameters.
Usually, we select only one sample from the population of interest. If the sampling process is
repeated for many times, then different samples generate a variety of values for the sample statistics x ,
s , and p . It follows that each of these statistics is a random variable. Just like any random variable,
each of the sample statistics x , s , and p has a mean or expected value, a variance, and a standard
deviation. The probability distribution of any sample statistic is called the sampling distribution of that
statistic.
* Sampling Distribution of x
The sample mean x is commonly used to make inferences about the population mean .
deviation of the random variable x . Then there can be proved the following properties of the sample mean
x .
1. The expected value of the sample mean is equal to the population mean:
E( x )= .
n
2. If the size of the population is large in comparison to the sample size, that is, if
0.05, then
N
x =
n , otherwise, x =
N n
N 1 n .
The standard deviation of the sample mean is called the standard error of the mean.
This theorem describes the relationship between sampling distribution of the sample mean and
respective population distribution.
In selecting simple random samples of size n from a population with a mean and a standard
deviation , the sampling distribution of the sample mean x can be approximated by a normal
probability distribution, if n 30 . The bigger the sample size, the better the approximation. If the
population itself has a normal probability distribution, then the sampling distribution of x is normally
population mean , we cannot expect the sample mean to be exactly equal to the population mean. The
absolute value of the difference between x and , |x | , is called the sampling error. The
sampling distribution of x can be used to provide information about the size of the sampling error.
As the sample size is increases, the standard error of the mean decreases. As a result, the larger
sample sizes will provide higher probabilities that the sample mean is closer to the population mean.
* Sampling Distribution of p
In many problems in business and economics, we need the sample proportion p to estimate the
population proportion p . The sample proportion is not other than a sample mean. Indeed, suppose a
simple random sample (x 1 , x 2 , , x n ) is selected from the population, and x i=1 , when a
Then the sample proportion is x i /n , which is exactly the sample mean.The probability distribution of
all possible values of the sample proportion p is called sampling distribution of the sample proportion.
E( p )= p .
n
2. If the size of the population is large in comparison to the sample size, that is, if
0.05, then
N
which implies the following. The sampling distribution of p can be approximated by a normal probability
np 5 and n(1p)5 .
The absolute value of the difference between the value of the sample proportion p and the value
of the population proportion p , | p p| , is called the sampling error. The information about the
As we know, the point estimators offer estimates for the values of the population parameters (mean,
proportion). However they do not provide information about the precision of this estimation. You will learn
further how interval estimation of population parameters can be used to deliver such information.
The interval [ 1 , 2 ] that covers the unknown population parameter with a probability of
c is called the confidence interval, and the probability c is called the confidence coefficient or the
confidence level.
In our course we will most often use the following confidence coefficients and the corresponding
z -scores:
(the confidence coefficient is c ), then the remaining area is (1 c , and the probability that z falls
(1c)
within each tail is 2 .
For instance, if c= 90%, then 5% of the area lies to the left of z c = 1.64 and 5% lies to
z c = 1.64.
the right of
E=z c x =z c
n .
In many sampling situations the value of the population standard deviation is unknown. If the sample
size is large ( n 30 ), we simply use the statistic s to estimate the population standard deviation .
Knowing the sample mean and the margin of error, the confidence interval for the population mean is
now written as
x E x + E .
Let us summarize the procedure of finding the confidence interval for the population mean (for the
deviation is known, use the formula x = / n , otherwise find the sample standard deviation
2
( x ix ) x .
s= and use it as an estimate of
n1
One way to improve the precision of the estimate without decreasing the confidence coefficient is to
enlarge the sample size. For a certain confidence coefficient c and given margin of error E , the
2
z
n= c
E( ) .
If is unknown, it can be estimated by the sample statistic s , provided that the sample size is
at least 30 elements.
the population standard deviation is often unknown. How could you build the confidence interval for
If the population has a normal probability distribution, the sampling distribution of the sample mean
x is normal regardless the sample size. To build the confidence interval for , sample standard
deviation s will be used to estimate the unknown population standard deviation , and a probability
distribution known as the t-distribution will be applied instead of standard normal probability distribution.
x
t=
The quantity
s is said to have a t-distribution, if X is a normally distributed random
n
variable. Here is the expected value of X , n , x , and s are, respectively, the sample size,
1. The t -distribution is a family of curves each bell-shaped and symmetric about the mean.
4. The mean, the median, and the mode of the t -distribution are 0.
5. As the degrees of freedom increase, the t -distribution approaches the standard normal
distribution.
t
We will use the notation to indicate the t -value with an area of in the upper tail of
2 2
The T -Distribution Table lists the values of t for the given confidence coefficients and
degrees of freedom.
Now let us see how the t -distribution is used to build a confidence interval for the population
mean. The procedure is similar to constructing a confidence interval for normal distribution in both we need
a sample mean x and a margin of error E . Let us summarize it in the following steps.
2. Determine the degrees of freedom d.f., the confidence coefficient c and the respective t
2
value.
s
E=t
3. Find the margin of error 2 n
.
confidence interval for a population mean. The following steps are to be made.
n(1 p ) 5 ).
z z =z c
3. Find the critical value 2 ( 2 ) that correspods to the given confidence coefficient
As we have seen earlier, one way to increase the precision of a confidence interval without
diminishing the confidence coefficient is to increase the sample size. Given a confidence coefficient of c
and a margin of error E , the minimum sample size n needed to estimate the population proportion
p is
2
z
n=p (1 p ) ()2
E
, p is known. If not, use p= 0.5.
24. Interval estimation of the population variance and standard deviation. Examples.
In many practical situations it is important to control the amount of variation in some characteristics
of the population. For instance, thousands of parts produced by an industry must vary little or not at all. To
control the variation 2 you may use its point estimate s 2 . Consequently, for the standard deviation
the point estimate s can be used. To build a confidence interval for the variance and standard
If a simple random sample of size n is selected from a normal population, the sampling
2 ( n1 ) s2
=
2
has a chi-square distribution with n1 degrees of freedom.
The values of 2 distribution are tabulated for various degrees of freedom. Each value in the table
represents the area under the chi-square curve to the right of the critical value. There are two critical values
2 2
for each level of confidence: L and R .
For the confidence level of c , the area to the right of 2R and 2L are, respectively,
1c 1c 1+ c
and
1 = c probability of obtaining a
2
2 2 2 . Thus, there is a value such that
( n1 ) s2 ( n1 ) s 2
2 2
L
2
R . As 2= , we can derive from the relationship 2L the
2 2
( n1 ) s2 ( n1 ) s 2 2 ( n1 ) s
2
following inequality:
2 . Similarly, from 2R the inequality
2L 2 2R
can be obtained. As a result, we can write the following general expression for the confidence intervals:
( n1 ) s 2 2 ( n1 ) s 2
Confidence Interval for 2
is
.
2R 2L
Ha
The alternative hypothesis is the complement of the null hypothesis, and it contains a statement of
strict inequality, such as , , or . The conclusion that H a is true can be made if the sample
H0
data indicate that is false.
Generally, a hypothesis test about the values of the population mean may take one of the
H0 0 H0 0 H0 = 0
: : :
Ha > 0 Ha < 0 Ha 0
: : :
No mater which of the three forms of hypothesis test is used, you always begin by assuming that the
equality condition in the null hypothesis is true. After performing the hypothesis test, you will take one of two
decisions:
Definition. The type I error is made if the null hypothesis is rejected when it is true. The type II
error is made if the null hypothesis is not rejected when it is false.
Thus, we have the following four results of a hypothesis testing.
State of Things
Decision:
H 0 is true H 0 is false
Do not reject
H0 Correct decision Type II error
Reject
H0 Type I error Correct decision
Although we cannot eliminate the possibility of errors while performing a hypothesis test, we can
indicate the probability of their occurrence. The common notations are:
Most applications of hypothesis testing control for the probability of making a Type I error and do
not always control for the probability of making Type II error. That is why, in order to avoid the risk of
making a Type II error, the statement Do not reject is used instead of Accept.
Definition. The maximum allowable probability of making a Type I error is called level of
significance for the test.
Common choices for the level of significance are = 0.1, = 0.05, = 0.01. The
lower the level of significance, the smaller the probability of rejecting a true null hypothesis.
After stating the null and alternative hypotheses, and specifying the level of significance, the next
step is selecting a random sample from the population and calculating the sample statistics. The statistic used
to estimate the parameter in the null hypothesis is called the test statistic. The following table shows the
population parameters and corresponding test statistics.
x z ( n 30 ) or t ( n 30)
p p z
2 s2 2
One way to decide whether to reject the null hypothesis is to check if the standardized test statistic
falls within a rejection region of the sampling distribution.
Definition. A rejection region (or critical region) of the sampling distribution is the range of values
of the test statistic for which the null hypothesis is rejected. The value of test statistic that establishes the
boundary of the rejection region, is called the critical value.
The following is the procedure of conducting a hypothesis test for a population mean (large-sample
case).
H 0 and Ha .
1.State the null and alternative hypotheses
4. Sketch the graph of the normal curve and indicate the rejection region. If the rejection region is
placed in only lower tail (or upper tail) of the sampling distribution, then we say the test is a left-tailed (or
right-tailed) hypothesis test. If the rejection region is placed in both the lower and the upper tails of the
sampling distribution, then we say the test is a two-tailed hypothesis test.
5.Find the standardized test statistic (for the population mean it is called z-statistic):
x 0 x 0
z=
/n if is known, or z=
s/n if is unknown.
is unknown. If the population has a normal distribution, you can use the t -distribution to make inferences
about the population mean. The test statistic for the mean is
x 0
t= (called t-statistic), where s is the sample standard deviation. This statistic has t-
s / n
The the procedure of conducting a hypothesis test for a population mean for the small-sample case is
as follows.
H0 Ha
1.State the null and alternative hypotheses and .
z 0 (values z 0 ).
3.Use the T-Distribution Table, to determine the critical value
x 0
t=
s / n .
The three forms for a hypothesis test about a population proportion p are as follows.
H0 p p0 H0 p p0 H0 p= p0
: : :
Ha p> p0 Ha p< p0 Ha p p0
: : :
p p0
z= .
p
29. Hypothesis tests about means and proportions with two populations. Examples.
You will learn further how to test a claim about the difference between the same parameters from two
populations. For instance, you may want to conduct a hypothesis test to find whether there is any difference
between educational quality provided at two high schools, or test the difference between the proportions of
defective parts supplied by two factories.
The type of test to be used is determined by the sizes of the samples selected from the two
populations, as well as by the fact of dependence or independence of the respective samples.
Definition. Two samples are called independent if they are selected from two different populations
and are not related one to another. Two samples are called dependent or matched if each element of one
sample corresponds to an element of the other sample.
For instance, if you select randomly 100 graduates from university A and 90 graduates from
university B, and test their qualification level, you obtain two independent samples. But if you select
randomly 70 freshmen from a university and measure their qualification level, then, after 3years, test the
same sample of students for their qualification level , then you have dependent (or matched) samples.
Ha 1 2 Ha 1 < 2 Ha 1 > 2
: : :
Difference between the means of two populations. Independent samples. Let us consider the
hypotheses tests about the difference between the means of two populations for independent samples. If each
x 1x 2 ,
sample size is at least 30, then the sampling distribution of the difference of the sample means,
can be approximated by a normal probability distribution with mean and standard deviation as follows.
E( x 1 x 2)=12 x x = x 2 + x 2 .
1 2 1 2
( x 1 x 2)(1 2)
z=
.
x 1
2
+ x 2
2
In real life it is often impractical to collect samples of large size. To test the difference between the
means of two small independent samples we assume that both populations have normal probability
distribution. With this condition, the sampling distribution for the difference of sample means
x 1x 2 is
( x1 x2 ) ( 12 ) s x x
t= x 1x 2
s x x , where 1 2
is the standard deviation of .
1 2
If the population variances are not equal, then the standard error is
s x x =
1 2
( s 12 s 22
+
n1 n 2 ) and d.f.
min ( n11, n21 )
.
Difference between the means of two populations. Matched samples. To perform a two-sample
d i for
hypothesis test with dependent samples, we must use a different technique. First find the difference
di
d=
each data pair. Then determine the mean of that differences: n . If both populations are normally
d
Let us denote by the mean of difference values in the population and formulate the null and
H0 d =0 H0 d 0 H0 d 0
: or : or :
Ha d 0 Ha d < 0 Ha d > 0
: : : .
di
d=
The sample mean and sample standard deviation for the difference values are n and
sd =
( d id )
.
n1
To test the null hypothesis, we will use the t-statistic
d d
t= with n1 degrees of freedom.
sd/ n
Difference between the proportions of two populations. The difference between two population
proportions
p1 and p2 can be tested using a sample proportion from each population. The following
H0 p1= p2 H0 p1 p2 H0 p1 p2
: : :
Ha p1 p2 Ha p1< p 2 Ha p1> p 2
: : : .
If the samples are randomly selected, independent and large enough to use a normal sampling
E ( p1 p2 )= p 1p 2
1 2
p p = p q
( n1 + n1 )
1 2
, where p is the weighted mean of the sample proportions, that is
n1 p 1+ n2 p2
p= q =1 p .
n 1+ n2 and
( p 1 p2 )( p 1 p2 )
z= .
p p
1 2
30. Individual and aggregate price indexes. Examples.
Index numbers are used as a descriptive statistic tool for describing the evolution of an economic
variable over time. An index number represents a ratio between the value of the variable recorded in one
period, called current period, and the value of the same variable recorded in the period, called basic period.
The National Bureau of Statistics regularly publishes a variety of indexes that can help users to
understand current business and economic situation. The most widely known and used is the Consumer Price
Index (CPI). The CPI measures changes in price over a period of time. Given a starting point or base period
with its associated index of 100, the CPI can be used to compare current period consumer prices with prices
in the base period. For example, a CPI of 110 shows that consumer prices increased approximately 10%
compared to the base period.
To compare prices in different years, we convert them to price relatives, or individual price indexes,
which express the unit price in each period as a percentage of the unit price in a base period.
Price period t
Price relative in period
t= 100 .
Base period price
Knowing the price relative, you can easily compare the price in any one year with the price in the
base year. Price relatives are very helpful in understanding and interpreting economic changes over time.
Economists are often more interested in the general price change for a group of products and services
taken as a whole. For example, if we are interested in the overall cost of living over time, we will need the
index that involves the price changes for a variety of items, including food, housing, clothing, medical care,
and so on. To measure the combined change of a group of items, a special aggregate price index is developed.
summing the unit prices in the period t and dividing that sum by the sum of unit prices of the base year:
It =
Pit 100
Pi 0
where
Pit is unit price for item i in period t , and Pi 0 is unit price for item i in the base
period.
The value of the index is heavily influenced by the items having large per-unit prices. Because of
such sensitivity, the simple index is not widely used. Instead, a weighted aggregate price index is commonly
applied. In computing this index, each item in the group is weighted according to its importance, for instance,
its quantity of usage or quantity weights. The quantity of usage shows the expected annual usage for each
type of item.
If
Qi denotes the quantity weight for item i , then the weighted aggregate price index is given
by
It =
Pit Qi 100
Pi 0 Q i .
The weighted index, compared with the simple aggregate index, shows a more moderate increase in
the expenses, since it takes into account the quantity of usage of the main products (bread and milk) and helps
to offset the large increase in reparation costs. The weighted aggregate index with quantities of usage as
weights is the preferred method for determining a price index for a group of products and services.
Qi are considered fixed and do not vary with time, they can be determined
When the quantities
is used. In this case the weighted aggragate price index is computed according to the formula
It =
Pit Qi 0 100
Pi 0 Q i 0
In the case when the quantity weights are revised and computed each year, the weight aggregate
It =
Pit Qit 100
.
Pi 0 Qit
This weighted aggregate index is called Paasche index. Although it has the advantage of showing the
recomputing the index numbers for the previous periods to reflect the effect of the current quantity weights.
Because of these disadvantages, the Laspeyres index is more widely used in applications.
The aggregate price index can be directly computed from individual price indexes of each item.
Indeed, for the Laspeyres index we have
P P
Pit Qi 0
it
Pi 0
Pi 0 Q i 0 it
w
Pi 0 i
It = 100= 100= 100 ,
Pi 0 Q i 0 P i0 Qi 0 wi
w i=Pi 0 Qi 0 is the weight applied to the individual price index for item i .
where
P P
Pit Qit 100= P it Pi 0 Qit P it wi
i0 i0
It = 100= 100 ,
Pi 0 Qit Pi 0 Qit wi
w i=Pi 0 Qit i .
where is the weight applied to the individual price index for item
When dealing with time series data (data collected over several time periods) that involve money
amounts, the interpretations can be very misleading if the price changes during the time are ignored.
Deflating a time series has an important application in calculating the Gross Domestic Product
(GDP). The GDP is the total value of all goods and services produced in a particular country. To adjust the
total value of goods and services produced so as to reflect the real changes in their volume produced and sold,
the GDP must be computed with a price index deflator. The procedure is similar to that illustrated in the
example with the wages calculations.
32. Components of a time series. Examples.
A forecast is a prediction of what will happen in the future. Suppose you are asked to forecast the
sales of a certain product in the coming year. To provide such a prediction, you will review the actual sales
data for the product in the previous periods to better understand the patterns of past sales. Historical sales
represent a time series.
Definition. A time series is a set of observations on a variable made at successive points of time or
over successive periods of time.
The objective of analyzing the time series is to provide good forecasts of future values of the time
series.
Forecasting method are classified into quantitative and qualitative methods. Quantitative forecasting
methods are used when (1) past information about the variable is available, (2) the data can be quantified, and
(3) it can be assumed that the pattern of the past will be continued in the future. If only past values of the time
series are considered, the forecasting procedure is called a time series method or causal method. We will
consider here three of the time series methods: smoothing, trend projection, and trend projection adjusted for
seasonal influence.
Qualitative methods generally use the expert judgement to provide forecasts. For instance, a group of
experts come to a consensus regarding the prime rate for the next year. Qualitative methods are applied when
the information used cannot be quantified and when historical data are unavailable.
Usually the four main components are distinguished in a behavior of the data in a time series: trend,
cyclical, seasonal, and irregular components.
Trend Component. The time series may show gradual shifts (or movements) to higher or lower values
over a long period of time. This shifting is usually the result of long-term factors such as demographic
changes, technology developement, consumer preferences and so on. The gradual shifting of the time series is
called the trend in the time series.
There can be other possible time series trend patterns ( linear decreasing , nonlinear, no trend, and so
on).
Cyclical Component. Usually, the values of the time series do not fall exactly on the trend line, and
often show alternating sequences of points below and above the trend line. Any repeated increasings and
decreasings about the trend line lasting more than one year can be atributed to the cyclical component of the
time series (see Figure.). This component of the time series is due to cyclical movements in the economy,
such as moderate inflation followed by a rapid inflation.
Seasonal Component. The trend and cyclical components can be observed when you are studying
historical data over multiannual periods. However, many time series show a regular pattern during one-year
period. For instance, peak sales are expected for the snow equipment during winter months and low sales
during summer months. Thus, the variability in data due to seasonal influences is determined by the seasonal
component of the time series. Generally, the seasonal component can be used to represent any regularly
repeating pattern within one-year period. For example, daily sales volume in a small market are expected to
be higher in the evening and lower during the day.
Irregular Component. The irregular component of the time series is responsable for deviations of the
actual values from those expected according to the effects of trend, cyclical, and seasonal components. It is
caused by the unanticipated factors, and, hence, is unpredictable.
We will consider here three forecasting methods: moving averages, weighted moving averages, and
exponential smoothing. They are called smoothing methods, since the objective of these methods is
smoothing out the effects of the irregular component of the time series. These methods are appropriate for a
stable time series, when there are no significant trend, cyclical and seasonal changes. They provide a high
level of precision for short-range forecasts, for instance a forecast for the next time period.
Mooving Averages. This method uses the average of the most recent n data values in the time
series to make a forecast for the next period. The computing formula is
One of the often-used measure of the accuracy of a method is the mean squared error (MSE)which is
equal to the average of the sum of squared errors.
Obviously, for a time series, moving averages of different lengths may provide different forecasting
accuracy. You may use trial and error to determine the length that minimizes the MSE for the past values in
the time series, and apply it for the next period.
Weighted Moving Averages. One variation of the moving average method involves weighted moving
averages when different weight is selected for each data value and a weighted average of the most recent
n results is computed as a forecast. Usually, the most recent data value is given the biggest weight, and
the weight decreases for older values. The sum of weights is equal to 1. This is a requirement to be respected
in selecting the weights. If we believe that the recent past value is better for prediction than the more distant
values, then larger weights should be given to the more recent observations. However, when the time series is
too variable, then approximately equal weights are to be given to all data values. To measure the forecast
accuracy, we can use MSE, choosing the combination of weights so as to minimize the mean squared error.
Exponential Smoothing. This method has minimal data requirements, and is good to use in
forecasting with large numbers of items. Exponential smoothing is a special case of weighted moving
averages method in which only one weight is selected the weight for the most recent data value. The
forecasting formula is as follows.
Ft +1= X t + (1 ) F t
where
Ft t ,
is forecast for the period
Xt t ,
is actual value of the time series in the period
is smoothing constant ( 0 1 ) .
As the formula shows, the forecast for the period t+1 is a weighted average of the actual value in
the period t and the forecasted value for the period t . It can be proved that the exponential smoothing
t X 1 , X 2 , , X n .
forecast for any period is a weighted average of all previous actual values
The formula for the exponential smoothing calculation can be rewritten as follows.
Ft +1=F t + ( X t F t )
.
It shows that the forecast in period t+1 is obtained by adjusting the forecast in period t by a
The trend projection method is applicable to time series that have a trend component.
To identify a linear trend, the simple linear regression can be applied. Remind that the least squares
method is used to determine the best straight-line relationship between two variables. Thus, the equation of
the linear trend is
Y t =at+ b
where
Yt t
is the trend value of the time series in period
t is time
For the time series on sport shoes sales, t =1 corresponds to the first year period, t=2
corresponds to the second year period, and so on. Formulas for computing the estimated regression
a=
X t tn X t
t 2n t 2
b= Xa t
where
Xt t
is the value of the time series in period
X X =
Xt
is the average value of the time series, that is n
t t
is the average value of t , that is t = n
n is number of periods.
Regression analysis can be applied to model also the curvilinear trends in time series.
Trend projection adjusted for seasonal influence. Let us see how to forecast a time series that has
both trend and seasonal components. In many situations economists are interested in period-to-period
comparisons of time-series values. In such cases seasonal effects must be taken into account, in order not to
make wrong conclusions about the overall trend in a time series. For example, the electric power
consumption may decrease in the summer months, whereas yearly use of electric power may be increasing.
Removing the seasonal effect from a time series is called deseasonalizing the time series. Thus, the
first step is to compute the seasonal indexes and use them to deseasonalize the data. After that, if the trend
exists, it can be estimated using the regression analysis.