Chapter 8 and 9
Chapter 8 and 9
A summary measure that describes any given characteristic of the population is known as
Parameter. Eg: population mean (µ), population variance (δ2), population standard deviation (δ),
population proportion (P), population moments (µr) are parameters.
The summary measure that describes the characteristic of the sample is known as Statistic.
Eg: sample mean ( X̄ ), sample variance (S2), sample standard deviation (S), sample proportion
(p), sample moments (
X̄ r ) are Statistics.
Statistical inference generally takes one of the two forms, namely, estimation of the population
parameter and testing of hypothesis.
For the purpose of general discussion, a population parameter is denoted by θ and the
^
corresponding statistic byθ . As already stated the parameter θ is unknown. The value of the
^
statistic θ is computed from the random sample taken from the population.
^
The statistic θ intended for estimating a parameter θ is called an Estimator ofθ . The specific
numerical value of an estimator calculated from the sample is called the Estimate.
The process of obtaining an estimate of the unknown value of a parameter by a statistic is called
Estimation. There are two types of estimations. One is the point estimation and the other is
interval estimation.
Point Estimation
It is the process of obtaining a single sample value (point estimate) that is used to estimate the
desired population parameter. The estimator is known as point estimator.
Eg: X̄ is a point estimate of µ.
S is a point estimate ofδ
The best estimator should be highly reliable and have such desirable properties as unbiasedness,
consistency, efficiency and sufficiency. These criteria are described as follows:
1. Unbiasedness: An estimator is a random variable since it is always a function of the
sample values. The expected value of the sample statistic is considered to be an unbiased
^
estimator if it equals the population parameter which is being estimated. This means E( θ
)=θ .
1
2. Consistency: It refers to the effect of sample size on the accuracy of the estimator. A
statistic is said to be consistent estimator of the population parameter if it approaches the
^
parameter as the sample size increases, i.e. θ →θ as n→N.
3. Efficiency: An estimator is considered to be efficient if its value remains stable from
sample to sample. The best estimator would be the one which would have the least
variance from sample to sample. From the three point estimators of central tendency,
namely the mean, median and mode, the mean is considered the least variant and hence a
better estimator.
4. Sufficiency: An estimator is said to be sufficient if it uses all the information about the
population parameter contained in the sample. For example, the statistic mean uses all the
sample values in its computation while median and mode do not. Hence the mean is a
better estimator in this sense.
Interval Estimation
Point estimator has some drawbacks. First a point estimator from the sample may not exactly
locate the population parameter (i.e. the value of the point estimator is not likely to be exactly
equal to the value of the parameter) resulting in some margin of uncertainty. If the sample value
is different from the population value, the point estimator does not indicate the extent of the
possible error. Secondly a point estimate does not specify as to how confident we can be that the
estimate is close to the parameter it is estimating. That is we cannot attach any degree of
confidence to such an estimate as to what extent it is closer to the value of the parameter.
Because of these limitations of point estimation, interval estimation is considered desirable. The
interval estimation involves the determination of an interval (a range of values) within which the
population parameter must lie with a specified degree of confidence. It is the construction of an
interval on both sides of the point estimate within which we can reasonably confident that the
true parameter will lie.
Ex: - Haramaya University wishes to estimate the average age of students who graduate with
B.Sc. degree. A random sample of 625 graduating students showed that the average age was 24
with a standard deviation of 5 years. Construct the 95% confidence interval for the true average
age of all such graduating students at the university and interpret it.
Hypothesis Testing
2
A statistical hypothesis is a conjecture (an assumption) about a population parameter which may
or may not be true. Hypothesis testing is a statistical procedure which leads to take a decision
about such an assumption for the population parameter being correct or not, by using data
obtained from the sample.
In hypothesis testing, the researcher must define the population under study, state the particular
hypothesis that will be checked, give the significance level, select sample from the population,
perform calculations required for statistical test and reach conclusion.
It is already expressed that a statistical hypothesis may or may not true. For each situation, there
two types of statistical hypotheses.
1. Null Hypothesis (H0):- is a statistical hypothesis that states there is no difference
between a parameter and a specific value or hypothesized value. H 0:µ=µ0 where µ is the
population mean and µ0 is the hypothesized mean
2. Alternative Hypothesis (H1):- is a statistical hypothesis that states there exists a
difference between a parameter and a specific value or hypothesized value.
H1: µ≠µ0 H1: µ<µ0 H1: µ>µ0
Errors in Hypothesis Testing
1. Type I error: is an error occurred if one rejects the null hypothesis which is actually
true.
2. Type II error: is an error occurred if one failed to reject the null hypothesis which is
actually false.
The maximum probability of committing type I error is called the level of significance and
denoted by α (alpha).
3
4. Define the critical (rejection) region.
5. If the value of the test statistic falls in the critical region (rejection region), reject the null
hypothesis; otherwise accept it.
6. Make a decision.
EX:
1. A research repots that the average salary of veterinarians is more than $42000. A sample
of 30 veterinarians has a mean salary of $43260. Test the reports claim. Assume the
population standard deviation is $5230.
2. A national magazine claims that the average college students watches less television than
the general public. The national average is 29.4 hours per week, with a standard deviation
2 hours. A random sample of 25 college students has a mean of 27 hours. Test the claim.
Assume normality.
3. A merchant believes that the average age of customers who purchase a certain brand of
wears is 13 years of age. A random sample of 35 customers had an average age of 15.6
years. At α=0.01, should this conjecture be rejected. The standard deviation of the
population is 1year.
Chapter 9
Simple Linear Regression and Correlation
In the previous chapters we have been dealing with a single variable. In this chapter we will deal
with a bi-variate data i.e. data involving two variables. In this section we will deal with the
problem of predicting the average value of one variable in terms of known values of the other
variable(s).
Regression may be defined as the estimation or prediction of the unknown value of one variable
from the known values of one or more variables. The variable whose values are to be estimated
or predicted is known as dependent or explained variable while the variable which are used in
determining the value of the dependent variable are called independent or predictor variables.
The regression study that involves only two variables is called simple regression and the
regression analysis that studies more than two variables is called multiple regression. If the
relation ship between the two variables can be described by a straight line then the regression is
known as linear regression other wise it is called non-linear.
The regression analysis involving only two variables and having a linear relationship is
called Simple Linear Regression. This linear relationship between the two variables is
represented by a straight line.
Regression Line (Line of Regression): is the line that gives the best estimate of one variable for
any given value of another variable. The regression line which is used to predict the values of Y
4
for any given value of X is called regression line of Y on X. similarly the regression line which is
used to predict the values of X for any given value of Y is called regression line of X on Y.
Regression Equation: is a mathematical equation that defines the relationship between two
variables.
Regression of Y on X
Model: Y= α + βX + Є
Where Y is the dependent variable
X is the dependent variable
α is constant term(intercept)
β is slope(change in Y for a unit change in X)
Є is the error term
To estimate the regression coefficients (α and β), the procedure is minimizing the sum of the
^
squares of the errors. Let the estimated model be Y = a + bX. Then, from sample data the values
of a (estimate of α) and b (estimate of β) can be obtained as follows:
n ∑ XY −∑ X ∑ Y
b= n ∑ X 2−( ∑ X )2 and a= Ȳ -b X̄ .
Correlation
Most of the variables in economics and business area show relationship. For example, price and
supply, income and expenditure, advertizing expenditure and sales. Thus in order to know the
degree or direction of such a relationship between variables, correlation analysis is important.
Correlation is a mathematical tool desired towards measuring the degree of the relationship
(degree of association) between the variables. Correlation that involves only two variables is
called simple correlation and which involves more than two variables is called multiple
correlations.
Covariance is a measure of the joint variation in two variables, i.e. it measures the way in which
the values of the two variables vary together. If the covariance is zero, there is no linear
relationship between the two variables. If it is negative, there is an indirect linear relationship
between them. If the covariance is positive, there is a direct linear relationship between the
variables.
5
Pearson’s coefficient of correlation (r)
Pearson’s coefficient of correlation (r) is used to measure the strength of the linear relationship
between two variables.
The population correlation coefficient is denoted by ρ and the sample correlation coefficient is
denoted by r.
n ∑ XY −∑ X ∑ Y
√ ∑ X 2−( ∑ X )2 √ n ∑ Y 2−( ∑ Y )2
r= n
Interpretation of r
If the value of r is -1 or 1, there is perfect negative or perfect positive linear relationship
between the variables.
If the value of r is approximately -1 or 1, there is a strong negative or strong positive
linear relationship between the variables.
If r is -0.5 (or approximately -0.5) or 0.5 (or approximately 0.5), there is moderate
negative or moderate positive linear relationship between the variables.
If r¿ 0, there is no linear relationship.
Ex:
1. Given the following data on supply (X) and sales (Y) of a certain commodity
Supply 60 62 6 70 7 75 71
(X) 5 3
Sales (Y) 10 11 1 15 1 19 14
3 6
6
2. The following summary results are obtained from price and demand of a commodity
∑price=30 ∑demand=40 ∑(price)(demand)=214
2 2
∑(price) =220 ∑(demand) =340 n=5
a. Identify the dependent and independent variable.
b. Estimate the regression equation.
c. Interpret the estimated coefficients.
d. Calculate the correlation coefficient between price and demand, and interpret it.
e. Find the coefficient of determination and interpret it.
2
2
S
3. Given n=25, X̄ =3.95, Ȳ =2.03, S x =85.35, S y =98.75, xy =90
a. Fit the regression equation Y on X.
b. Interpret the estimated coefficients.
c. Calculate the correlation coefficient and interpret it.
d. Find the coefficient of determination and interpret it.