Processing & Analysis of Data
Processing & Analysis of Data
There are a number of items that belong in this portion of statistics, such as:
1) A confidence interval gives a range of values for an unknown parameter of the population by
measuring a statistical sample. This is expressed in terms of an interval and the degree of
confidence that the parameter is within the interval.
Methods of Parameter Estimation:
a) Probability Plotting: A method of finding parameter values where the data is plotted on
special plotting paper and parameters are derived from the visual plot
b) Least Squares Method: A method of finding parameter values that minimizes the sum of
the squares of the residuals.
c) Maximum Likelihood Estimation: A method of finding parameter values that, given a
set of observations, will maximize the likelihood function.
d) Bayesian Estimation Methods: A family of estimation methods that tries to minimize
the posterior expectation of what is called the utility function. In practice, what this means is
that existing knowledge about a situation is formulated, data is gathered, and then posterior
knowledge is used to update our beliefs.
e) Method of Moment: It is based on moments of the distribution.
2) Tests of significance or hypothesis testing where scientists make a claim about the population
by analyzing a statistical sample. By design, there is some uncertainty in this process. This can
be expressed in terms of a level of significance.
Techniques that social scientists use to examine the relationships between variables, and
thereby to create inferential statistics, include linear regression analyses, logistic regression
analyses, ANOVA, correlation analyses, structural equation modeling, and survival analysis.
When conducting research using inferential statistics, scientists conduct a test of significance to
determine whether they can generalize their results to a larger population. Common tests of
significance include the chi-square and t-test. These tell scientists the probability that the results
of their analysis of the sample are representative of the population as a whole.
Bi-variate Data Analysis
Bi-variate data is when you are studying two variables. For example, if you are studying a
group of college students to find out their average SAT score and their age, you have two pieces
of the puzzle to find (SAT score and age). Or if you want to find out the weights and heights of
college students, then you also have bi-variate data. In bi-variate data analysis includes scatter
plots, simple regression and simple correlation, ANOVA analysis etc.
Multivariate Data Analysis
Multivariate e data is when you are studying more than two variables. Multivariate
analysis is used to study more complex sets of data than what univariate analysis methods can
handle. This type of analysis is almost always performed with software (i.e. SPSS or SAS,
STATA), as working with even the smallest of data sets can be overwhelming by hand.
The following methods include the multivariate data analysis:
1) Additive Tree.
2) Canonical Correlation Analysis.
3) Cluster Analysis.
4) Correspondence Analysis / Multiple Correspondence Analysis.
5) Factor Analysis.
6) Generalized Procrustean Analysis.
7) Independent Component Analysis.
8) MANOVA.
9) Multidimensional Scaling.
10) Multiple Regression Analysis.
11) Partial Least Square Regression.
12) Principal Component Analysis
13) Redundancy Analysis.
Test of Hypothesis/Significance
Test of Hypothesis/ Significance:
Test of hypotheses is a statistical procedure to arrive at a conclusion or decision on the
basis of samples and test whether the formulated hypothesis is either rejected or accepted in
probability sense. The main aim of test of hypothesis is to reject the null hypothesis.
Steps or procedure of hypothesis testing:
1. Set up a hypothesis:
The first step in hypothesis testing is to establish the hypothesis to be tested.
Hypothesis:
A hypothesis is an assumption about the parameter of the population or about the form of a
population to be tested.
For example,
0 and 2 02 are two hypothesis.
Where, = Population mean
2 = Population variance
0 = a specified value of a population mean
The probability of type I error, denoted by , is called the level of significance. The value of
may be 1%, 2%, 5%, 10% etc. The probability of type II error is denoted by and 1- is called
power of the test.
3. Determination of a suitable test statistic:
The third is to determine a suitable test statistic. The usual test statistic used in hypothesis testing
is Z -statistic, t- statistic, F- statistic & 2 -statistic.
4. Determine the critical Region/Critical Value:
It is important to specify, before the sample is taken, which values of the test statistic will lead to
a rejection of H0 when it is true and which lead to acceptance of H0. The former is called the critical
region.
Alternative hypothesis is i. H A : 0
ii. H A : 0
iii. H A : 0
The required test statistic is given by
x 0
Z When is known or
n
x 0
Z when is unknown but estimated from large sample n 30
s
n
Problem1: A sample of 900 members has a mean 3.5cms, and s.d. 2.60 cms. Is the sample from
a large population of mean 3.25 cms. And s.d. 2.60?
Problem 2: The mean lifetime of a sample of 120 light tubes produced by a company is found to
be 1670 hours with standard deviation of 80 hours. Test the hypothesis that the mean lifetime of
the tubes produced by the company is 1650 hours.
Problem 3: A manufacturer of bolts claims that the mean length of bolts is 5.5 inch with a
standard deviation of 0.02 inch. A random sample of 42 bolts yields a mean of 5.71 inch. Do
these data provide sufficient evidence to indicate that the true mean length is equal to the mean
length claimed by the manufacturer?
(ii)Test of significance of difference of means:
Null hypothesis, H 0 : 1 2
Alternative hypothesis, i. H1 : 1 2
ii. H1 : 1 2
iii. H1 : 1 2
To test the above null hypothesis the required test statistic is given by
x1 x2
Z , when are known or
12 22
n1 n2
x1 x2
Z , when are not known but estimated from large sample
s12 s22
n1 n2
Problem 4:
You are given the following information relating to purchase of bulbs from two manufactures A
and B:
Manufacturer No. of Bulbs bought Mean life S.D
A 100 2950 hrs. 100 hrs.
B 100 2970 hrs. 90 hrs
Is there a significance difference in the mean life of two makes of bulbs?
Problem 5:
I.Q. test on two groups of boys and girls gave the following results:
Girls: x 78, s.d . 10, n 40
Alternative hypothesis: H 1 : 0
To test the above null hypothesis the required test statistic is given by
P 0 x
Z , where P=
0 (1 0 ) n
n
Problem 6: 20 people were attacked by a disease and only 18 survived. Will you reject the
hypothesis that the survival rate, if attacked by this disease, is 85% in favor the hypothesis that is
more, at 5% level of significance?
Problem 7: A random sample of 100 seeds was taken from a large consignment for examination
and 15 were found to be defective. Can we accept the suppliers claim that the proportion of bad
seeds in the consignment is .03?
(iv) Test of significance of difference of proportions:
Null hypothesis, H 0 : 1 2 (where is the proportion)
Alternative hypothesis, H1 : 1 2
To test the above null hypothesis the required test statistic is given by
P1 P2 x1 x n p n2 p 2
Z , where p1 and p 2 2 and p= 1 1
1 1 n1 n2 n1 n2
pq ( )
n 1 n2
Problem 8: Before an increase in excise duty on tea, 800 persons out of a sample of 1000
persons were found to be tea drinkers. After an increase in duty, 800 people were tea drinkers in
a sample of 1300 people. State whether there is a significant decrease in the consumption of tea
after the increase in excise duty?
Problem 9: In a year of there are 956 births in town A of which 52.6% were males while in
towns A and B combined, this proportion in total of 1406 births was 0.495. Is there any
significant difference in proportion of male births in two towns?
sample n 30 . We may have to face some situations where the sample is not large enough and
U is not known. In such uses the estimate of U can be obtained and the test statistic
becomes-
U E U
t
Estimated U
Which is distributed as students t with `’ degree of freedom, where `’ is less than ‘n’.
Applications of t-test:
i) Test of significance for single mean
ii)Test of significance of difference of means
iii) Test of significance for sample correlation coefficient
iv)Test of significance of difference of means from correlated populations.
v) Test of significance of observed regression coefficient.
vi) Test of significance of difference of regression coefficients
vii) Test of significance for observed partial correlation coefficient
Test of single population mean:
We want to test null hypothesis, H 0 : 0
Alternative hypothesis, i. H 1 : 0
ii. H 1 : 0
iii. H 1 : 0
To test the above null hypothesis, the required test statistic will be
x 0
t , which is distributed as students ‘t’ with =n-1 degree of freedom.
s
n
Problem 1:
The weekly wages of 10 workers taken at random from a factory are given below:
Wages (Tk.): 578, 572, 570, 568, 572, 578, 570, 572, 596, 584
Is is possible that the mean wage of all workers of this factory is Tk. 580?
Problem 2:
The following are the weights (in lbs) of a random sample of 10 employees working in the
shipping department of a wholesale grocery firm: 154, 154,186,243,159,174,183,163,192,281.
On the basis of this data, can we conclude at the 0.05 significance level that the firm’s shipping
department employees have mean weight of 160 lbs?
Problem 3:
The daily sales (in Tk.) of a shop in a market are given below:
500 645 2100 1950 1715 895 1230 795 899
1230 1315 865 980 810 1815 1900 1400 1150
1125 1050 950 1150 1925 890 1600 1625 1750
Test the hypothesis that the average sale of the shop is Tk. 1400 on the basis of the above sample.
Alternative hypothesis, i. H1 : 1 2
ii. H1 : 1 2
iii. H1 : 1 2
To test the above null hypothesis, the required test statistic will be
x1 x 2
t , which is distributed as students‘t’ with n1 n2 2 degree of freedom.
s2 s2
n1 n2
2 n1 1s1 n2 1s 2
2 2
Here, s
n1 n2 2
Problem 4:
Samples of two different types of bulbs were tested for length of life, and the following data
were obtained:
Type I (Hours): 1200, 1350, 1400,1320,1420,1410, 1100, 1300
Type II (Hours):: 1200, 1400, 1420, 1230, 1430, 1250, 1300, 1500, 1240
Is the significance difference of mean length of bulbs?
Problem 5:
Two types of batteries are tested for their length of life and the following data are obtained:
Type A (Hours): 450,500 600,900,850,750,930,750,800,750,630,450,620
Type B (Hours): 360,530,420,620,720,420,560,850,650,450,610,
Is there a significance difference in the two means?
Problem 6:
An equal opportunities committee is conducting an investigation if in comparable jobs; men and
women workers are paid identical wages. The following information is obtained on 15 males and
18 females:
Alternative hypothesis, i. H I : 0
ii. H I : 0
iii. H I : 0
To test the above null hypothesis, the required test statistic is given by
r n2
t , Where r is sample correlation co-efficient
1 r 2
which is distributed as student ‘t’ with (n-2) d.f.
Problem 7:
In a study of the relationship between expenditure (X) and annual sales volume (Y), a sample of
10 firms yielded the co-efficient of correlation r=0.93 Can we conclude on the basis of this data
that X and Y are linearly correlated?
Problem 8:
The following figures relate to age and pressure of 10 women.
Age 56 42 36 47 49 42 60 72 63 55
Blood pressure 149 125 118 128 145 140 155 160 149 150
Can we conclude on the basis of above data that age and blood pressure are linearly correlated?
2 -test
Karl Pearson first used it in 1900. The 2 test is one of the simplest and most widely used non-
parametric tests in statistical work. It makes no assumption about the population being sampled.
The quantity 2 describes the magnitude of discrepancy between theory and observation. If 2
is zero, it means that the observed and expected frequencies completely coincide. The greater the
value of 2 , the greater would be the discrepancy between observed and expected frequencies.
2 O E 2 , where O=observed frequency, E=expected frequency.
E
Degree of freedom:
The no. of degree of freedom is described as the number of observations that are free to vary
after certain restrictions have been imposed on the data.
Application of 2 -test:
i) To test single variance
ii) To test goodness of fit
iii)To test independence of attributes
iv)To test of significance of equality of several variances
v) To test of significance of equality of several proportions
vi)To test of significance of equality of several correlation coefficients
Assumptions for the application of 2 test:
The following conditions must be met in order for chi-square analysis to be applied.
i. The experimental data (sample observation) must be independent.
ii. The sample data must be drawn at random from the target population.
iii. The data should be expressed in original units for convenience of comparison, not in
percentage or ratio form.
iv. The sample should contain at least 50 observations.
v. There should not be less than five observations in any one cell (each data entry is
known as a cell).
vi. The constraint on the cell frequencies is O E
Single variance test:
Let us suppose that we have a random sample of size ‘n’ consisting of x1,x2,x3,....xn drawn from a
2
normal population. We want to test the null hypothesis that the population variance is 0 .
2 2
i.e. null hypothesis, H 0 : 0 , where 0
2
is a specified value of 2 alternative
2
i. H1 : 0
2
hypothesis,
2
ii. H1 : 0
2
2
iii. H1 : 0
2
To test the above null hypothesis, the required test statistic is given by
2
n 1s 2
2 , whichf is distributed as 2 with (n-1) degree of freedom.
0
If 2 21 / 2 and 2 2 / 2 we reject null hypothesis otherwise we accept the null hypothesis.
Problem 1:
Weights in kilograms of 10 shipments are given below: 38,40,45,53,47,43,55,48,52,49.
Can we say that variance of the distribution of weights of all shipments from which the above
sample of 10 shipments was drawn is equal to 20 square kilogram?
Problem 2:
Prices of shares of a company on the different days in a month were found to be
66,65,69,70,69,71,70,63,64 and 68. Test whether the variance of the shares in the month is 9.
Test for independence of attributes:
One of the most frequent uses of 2 is for testing the null hypothesis that two criteria of
classification are independent. They are independent if the distribution of a criterion in no way
depends on the distribution of the other criterion. If they are not independent, there is an
association between the criteria.
Let us designate the two attributes are A and B where attribute A is assumed to have ‘r’
categories and attribute B is assumed to have ‘c’ categories. Furthermore, assume the total
number of observations in the problem is N.
B B1 B2 ..... ........ Bj Bc Total
A ..... ........
A1 O11 O12 .... ... O1j ... O1c R1
A2 O2 O22 .... ... O2 ... O2 R2
To test the above null hypothesis, the required test statistic is given by,
O 2ij
2
N , Where, Oij Observed frequency in the ith row and jth column
i j Eij
Rj C j
Eij Expected frequency , Ri is the row total and Cj is the column total
N
which follows 2 distribution with r 1c 1d . f .
If the calculated of 2 is greater than or equal to critical value, then null hypothesis will be
Contingency Table:
Contingency table, sometimes called a two-way frequency table or cross tabulation or
crosstab, is a tabular mechanism with at least two rows and two columns used in statistics to
present categorical data in terms of frequency counts. More precisely, an contingency table
shows the observed frequency of two variables, the observed frequencies of which are arranged
into “r” rows and “c” columns. The intersection of a row and a column of a contingency table is
called a cell. Problem-1 is an example of a 2×2 contingency table
Problem 1: A sample of 200 people with a particular disease was selected. Out of these, 100
were given a drug and the others were not given any drug. The results are as follows:
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug is effective or not.
Problem 2:
The following contingency table shows the classification of 2000 workers in a factory, according
to the disciplinary action taken by the management and promotional experience:
Promotional experience
Disciplinary
Promoted Not promoted
Not-offenders 146 462
Offenders 54 1338
Test whether the disciplinary action taken and promotional experience are independent.
Problem 3:
Based on information on 1100 randomly selected fields about the tenancy status of the
cultivators of these fields and use of fertilizers collected in an agro economic enquiry, the
following classification was noted:
Owned Rented
Using fertilizer 516 184
Not using fertilizer 64 336
Would you conclude that owner-cultivators are more inclined towards the use of fertilizers?
Problem 4:
Price of a basket of goods and services showed the following trend in up-contry and midtown
markets:
Up country Trend
Increasing Not increasing
Increasing 60 31
Mid-town Trend
Not increasing 15 5
Show of the trends in up-country prices and in mid-town prices has any significant association.
F-test
The F-test is based on F-distribution which was named in honor of R.A Fisher who first
introduced it in 1924. F-distribution is usually defined in terms of the ratio of the variances of
2
s1
two normally distributed populations. Therefore the F-test is defined as F 2 which follows
s2
F-distribution with ( n1 1 ) and ( n2 1 ) degree of freedom.
Applications of F-test:
(i) Test of significance for equality of two population variances
(ii) Test of significance for homogeneity of population means
(iii) Test of significance of an observed correlation ratio
(iv) Test of significance of linearity of regression
(v) Test of significance of an observed multiple correlation coefficient.
Testing the hypothesis for equality of two variances:
2 2
We want to test the null hypothesis, H 0 : 61 6 2
2 2
Alternative hypothesis, H 1 : i. 61 6 2 ,
2 2
ii. 61 6 2
2 2
iii. 61 6 2
To test the above null hypothesis the required test statistic is given by
s
2
s2 2 2
freedom or F 2 2 s1 , which follows F-distribution with ( n2 1 ) and ( n1 1 ) degree of
s1
freedom.
Problem 1:
Two sources of raw materials are under considered by a company. Both sources seem to
have similar characteristics but the company is not sure about their respective uniformity. A
sample of 10 lots from sources A-yields a variance of 225 and a sample of 11 lots from sources
B-yields a variance of 200.Is it likely that the variance of sources A is significantly greater than
the variance of source B?
Problem 2: A sample of the monthly earnings records of 15 employees of company A has a
variance of Tk. 15.90 while a similar sample of 27 employees for company B has a variance of
Tk. 17.50. Is it safe to assume that there is less variance in company A than in company B?
Problem 3:
Samples of two different types of bulbs were tested for length of life, and the following data
were obtained:
Type I (Hours): 1200, 1350, 1400,1320,1420,1410, 1100, 1300
Type II (Hours):: 1200, 1400, 1420, 1230, 1430, 1250, 1300, 1500, 1240
Is the significance difference of variation of life time of bulbs?
Problem 4:
Two types of batteries are tested for their length of life and the following data are obtained:
Type A (Hours): 450,500 600,900,850,750,930,750,800,750,630,450,620
Type B (Hours): 360,530,420,620,720,420,560,850,650,450,610,
Is there a significance difference in the two variances of length life time?
Analysis of variance:
Analysis of variance is a technique of partitioning the total sum of square deviation of all
sample values from the grand mean into different variations for which different factors are liable
and it is used to test of hypothesis for F-test.
One-way classification:
When data obtained from a population can be classified and arranged on the basis of one
factor only, the data referred to as one-way classification.
In One-way classification,
Total variation=Sum of square due to treatment+ Sum of square due to error.
Null Hypothesis, H 0 : 1 3 .......... k
Alternative hypothesis, H I : 1 2 ............ k
Analysis of variance table for one-way classification
Problem 5:
The following data represent the no. of unit of production per day turned out by 5 different
workers using 4 different types of machines:
W Machine Type
O A B C D
R 1 48 36 48 38
K 2 48 40 50 44
E 3 37 38 40 36
R 4 43 34 45 32
S 5 40 47 51 42
i) Test whether the mean productivity is the same for 4 different machine types
ii) Test whether the 5 workers differ with respect to mean productivity.
Problem 6:
The following table gives monthly sales (in thousand rupees) of a certain firm in three States by
its four salesmen:
States Salesman
I II III IV
A 10 7 6 8
B 8 9 6 5
C 11 7 8 8
i. Test whether there is any significant difference of firm salesman
ii. Test whether there is any significant difference of the three States.
Problem 7:
Following data gives the number of refrigerators sold by 4 salesmen in three months:
Salesman
Month A B C D
January 52 43 48 39
June 45 48 50 45
December 41 45 41 48
Determine whether there is any difference in the average sales made by four salesmen.
Correlation Matrix:
A correlation matrix is a table showing correlation coefficients between sets of variables.
Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj).
This allows you to see which pairs have the highest correlation.