0% found this document useful (0 votes)
80 views

Processing & Analysis of Data

1. Data processing involves editing, coding, classifying, and tabulating raw data to prepare it for analysis. This includes activities like data cleaning, assigning numeric codes to responses, and organizing the data into tables. 2. Coding assigns numeric codes or symbols to questionnaire responses to facilitate analysis. A codebook documents the coding instructions and variable information. 3. Classification groups the data into homogeneous categories based on common attributes or numerical ranges. 4. Tabulation systematically arranges the classified data into rows and columns for comparison and statistical analysis. Different types of tabulation exist depending on how many variables are considered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Processing & Analysis of Data

1. Data processing involves editing, coding, classifying, and tabulating raw data to prepare it for analysis. This includes activities like data cleaning, assigning numeric codes to responses, and organizing the data into tables. 2. Coding assigns numeric codes or symbols to questionnaire responses to facilitate analysis. A codebook documents the coding instructions and variable information. 3. Classification groups the data into homogeneous categories based on common attributes or numerical ranges. 4. Tabulation systematically arranges the classified data into rows and columns for comparison and statistical analysis. Different types of tabulation exist depending on how many variables are considered.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Processing & Analysis of Data

Dr. Md. Musa Khan


Associate Professor
DBA, IIUC
Processing of data:
The data, after collection, has to be processed and analyse in accordance with the outline
laid down for the purpose at the time of developing the research plan. This is essential for a
scientific study and for ensuring that we have all relevant data for making contemplated
comparisons and analysis. Technically speaking, processing implies editing, coding,
classification and tabulation of collected data so that they are amenable to analysis.
Data Processing Operations:
1. Editing:
Editing is the process of making data ready for coding and transfer to data storage. Its
purpose is to ensure completeness, consistency and reliability. Editing exists two editing as field
and house editing.
Field Editing:
Field editing is preliminary editing by a field supervisor on the same day as the interview is
going on. Its purpose is to catch technical omissions, check handwriting and clarify responses.
The daily field edit allows the respondents to fill in omissions. It also indicates the need for
further training of interviewers. It does not occur with mail surveys.
In-house editing/ central editing
In-house editing is a rigorous editing job performed by centralized office staff. In house
editors look for completeness of entries on each day of the week and perform other editing
activities.
2. Coding
Coding is the process of identifying and classifying each answer with a numerical score or
other symbol. Code is a rule used for interpreting, classifying and recording data in the coding
process. Field is a collection of characters that represent a single type of data. Record is a
collection of related fields. File is a collection of related records. Data matrix is a rectangular
arrangement of data into rows and columns. Code construction is necessary for questionnaire
preparation of opinion and other socio-economic surveys.
Example: Self-regulation by business itself is preferable to government control business
Strongly agree = 5
Agree = 4
Neither agree nor disagree = 3
Strongly disagree = 2
Disagree = 1
Coding Questions:
The respondent code and the record number should appear on each record in data. If the
questionnaire contains unstructured questions, codes are assigned after the questionnaires have
been returned from the field which is post coding. Coding of structured questions is relatively
simple, because the response options are predetermined which is pre-coding. The researcher
assigns a code for each response to each question and specifies the appropriate record and
columns in which the response codes are to appear.
Example: (Single response)
Do you have a currently valid passport?
1. Yes 2. No (1/54)
For this question, a “Yes” response is coded 1 and a “No” response 2. The numbers in
parentheses indicate that the code assigned will appear on the first record for this respondent
in column 54.
Example: (Multiple responses)
Which account do you now have at this bank? (“X” as many as apply)
Regular savings account (162)
Regular savings account (163)
Mortgage (X) (164)
Now account (X) (165)
Club account (X) (166)
Line of credit (X) (167)
Term savings account (X) (168)
Savings bank life insurance (X) (169)
Home improvement loan (X) (170)
Auto loan (X) (171)
Order services (X) (172)

Code book / Note book:


A code-book contains coding instructions and the necessary information about variables in
data set. It guides the coders in their works and helps the researcher to properly identify and
locate the variables. Even if the questionnaire has been prerecorded, it is helpful to prepare a
formal code-book. A code-book generally contains the information such as (1) column number
(2) record number (3) variable number (4) variable name (5) question number (6) instructions for
coding.
3. Classification:
Most research studies result in a large volume of raw data which must be reduced into
homogeneous groups if we are to get meaningful relationships. This fact necessitates
classification of data which happens to be the process of arranging data in groups or classes on
the basis of common characteristics. Data having a common characteristic are placed in one class
and in this way the entire data get divided into a number of groups or classes. Classification can
be one of the following two types, depending upon the nature of the phenomenon involved:
a) Classification according to attributes: As stated above, data are classified on the basis of
common characteristics which can either be descriptive (such as literacy, sex, honesty, etc.)
or numerical (such as weight, height, income, etc.). Descriptive characteristics refer to
qualitative phenomenon which cannot be measured quantitatively; only their presence or
absence in an individual item can be noticed. Data obtained this way on the basis of certain
attributes are known as statistics of attributes and their classification is said to be
classification according to attributes.
b) Classification according to class-intervals: Unlike descriptive characteristics, the numerical
characteristics refer to quantitative phenomenon which can be measured through some
statistical units. Data relating to income, production, age, weight, etc. come under this
category. Such data are known as statistics of variables and are classified on the basis of
class interval.
3.Tabulation:
Tabulation is a systematic & logical presentation of numeric data in rows and columns to
facilitate comparison and statistical analysis. It facilitates comparison by bringing related
information close to each other and helps in further statistical analysis and interpretation.
Tabulation is essential because of the following reasons:
1. It conserves space and reduces explanatory and descriptive statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of errors and omissions.
4. It provides a basis for various statistical computations.
Types of Tabulation:
(1) Simple Tabulation or One-way Tabulation:
When the data are tabulated to one characteristic, it is said to be a simple tabulation or one-
way tabulation.
For example: Tabulation of data on the population of the world classified by one
characteristic like religion is an example of a simple tabulation.
(2) Double Tabulation or Two-way Tabulation:
When the data are tabulated according to two characteristics at a time, it is said to be a
double tabulation or two-way tabulation.
For example: Tabulation of data on the population of the world classified by two
characteristics like religion and sex is an example of a double tabulation.
(3) Complex Tabulation :
When the data are tabulated according to many characteristics, it is said to be a complex
tabulation.
For example: Tabulation of data on the population of the world classified by three or more
characteristics like religion, sex and literacy, etc. is an example of a complex tabulation.
Cross-tabulation:
A statistical techniques that describes two or more variables simultaneously and results in
tables that reflect the joint distribution of two or more variables that have a limited number of
categories or distinct values. Cross tabulation tables are also called contingency tables. Cross-
tabulation with two variables is known as bi-variate cross tabulation.
Example:
Internet Usage Sex Row total
Male Female
Light 5 10 15
Heavy 10 5 15
Column total 15 15 30
Data analysis:
Data analysis is defined as a process of cleaning, transforming, and modeling data to
discover useful information for decision-making. The purpose of Data Analysis is to extract
useful information from data and taking the decision based upon the data analysis.
Analysis, particularly in case of survey or experimental data, involves estimating the values
of unknown parameters of the population and testing of hypotheses for drawing inferences.
Analysis may, therefore, be categorized as descriptive analysis and inferential analysis or
statistical analysis.
Data in statistical data analysis consists of variable(s). Sometimes the data is uni-
variate , or bi-variate or multivariate. Depending upon the number of variables, the researcher
performs different statistical techniques. If the data in statistical data analysis is multiple in
numbers, then several multi-variate’s methods can be performed. These are factor statistical data
analysis, discriminant statistical data analysis, etc. Similarly, if the data is singular in number,
then the uni-variate statistical data analysis is performed. This includes t-test for significance, z-
test, F-test, ANOVA one way, etc.
Descriptive Analysis:
Descriptive statistics is looking for any general characteristics, patterns, and describe
collected real-world data sample. Descriptive analysis is largely the study of distributions of one
variable.
Uni-variate Data Analysis/One Variable Analysis/ Descriptive Statistics
Uni-variate analysis is the simplest form of analyzing data. “Uni” means “one”, so in
other words your data has only one variable. It doesn’t deal with causes or relationships (unlike
regression) and its major purpose is to describe; it takes data, summarizes that data and finds
patterns in the data.

There are a number of items that belong in this portion of statistics, such as:

A) Statistical Charts and Diagrams: Frequency distribution, Line graph,Bar diagrams,


Histogram, Frequency curve, Frequency polygon, Ogive, Pictogram, Cartogram etc.
B) Measures of Central Tendency: The average, or measure of the center of a data set,
consisting of the mean, median, mode, or mid-range,quartile, decile and percentile.
C) Measures of Variation: The spread of a data set, which can be measured with the range,
inter-quartile range, variance,standard deviation, mean deviation and coefficient of variation.
D) Shape of the Distribution: Measurements such as skewness and kurtosis.
These measures are important and useful because they allow scientists to see patterns among data,
and thus to make sense of that data. Descriptive statistics can only be used to describe the
population or data set under study. The results cannot be generalized to any other group or
population.
Inferential analysis/Statistical Analysis :
Inferential analysis is concerned with the various tests of significance for testing hypotheses
in order to determine with what validity data can be said to indicate some conclusion or
conclusions. It is also concerned with the estimation of population values. It is mainly on the
basis of inferential analysis that the task of interpretation (i.e., the task of drawing inferences and
conclusions) is performed. Scientists use inferential statistics to examine the relationships
between variables within a sample and then make generalizations or predictions about how those
variables will relate to a larger population.
It is usually impossible to examine each member of the population individually. So
scientists choose a representative subset of the population, called a statistical sample, and from
this analysis, they are able to say something about the population from which the sample came.

There are two major divisions of inferential statistics:

1) A confidence interval gives a range of values for an unknown parameter of the population by
measuring a statistical sample. This is expressed in terms of an interval and the degree of
confidence that the parameter is within the interval.
Methods of Parameter Estimation:
a) Probability Plotting: A method of finding parameter values where the data is plotted on
special plotting paper and parameters are derived from the visual plot
b) Least Squares Method: A method of finding parameter values that minimizes the sum of
the squares of the residuals.
c) Maximum Likelihood Estimation: A method of finding parameter values that, given a
set of observations, will maximize the likelihood function.
d) Bayesian Estimation Methods: A family of estimation methods that tries to minimize
the posterior expectation of what is called the utility function. In practice, what this means is
that existing knowledge about a situation is formulated, data is gathered, and then posterior
knowledge is used to update our beliefs.
e) Method of Moment: It is based on moments of the distribution.

2) Tests of significance or hypothesis testing where scientists make a claim about the population
by analyzing a statistical sample. By design, there is some uncertainty in this process. This can
be expressed in terms of a level of significance.

Techniques that social scientists use to examine the relationships between variables, and
thereby to create inferential statistics, include linear regression analyses, logistic regression
analyses, ANOVA, correlation analyses, structural equation modeling, and survival analysis.
When conducting research using inferential statistics, scientists conduct a test of significance to
determine whether they can generalize their results to a larger population. Common tests of
significance include the chi-square and t-test. These tell scientists the probability that the results
of their analysis of the sample are representative of the population as a whole.
Bi-variate Data Analysis
Bi-variate data is when you are studying two variables. For example, if you are studying a
group of college students to find out their average SAT score and their age, you have two pieces
of the puzzle to find (SAT score and age). Or if you want to find out the weights and heights of
college students, then you also have bi-variate data. In bi-variate data analysis includes scatter
plots, simple regression and simple correlation, ANOVA analysis etc.
Multivariate Data Analysis
Multivariate e data is when you are studying more than two variables. Multivariate
analysis is used to study more complex sets of data than what univariate analysis methods can
handle. This type of analysis is almost always performed with software (i.e. SPSS or SAS,
STATA), as working with even the smallest of data sets can be overwhelming by hand.
The following methods include the multivariate data analysis:
1) Additive Tree.
2) Canonical Correlation Analysis.
3) Cluster Analysis.
4) Correspondence Analysis / Multiple Correspondence Analysis.
5) Factor Analysis.
6) Generalized Procrustean Analysis.
7) Independent Component Analysis.
8) MANOVA.
9) Multidimensional Scaling.
10) Multiple Regression Analysis.
11) Partial Least Square Regression.
12) Principal Component Analysis
13) Redundancy Analysis.

Test of Hypothesis/Significance
Test of Hypothesis/ Significance:
Test of hypotheses is a statistical procedure to arrive at a conclusion or decision on the
basis of samples and test whether the formulated hypothesis is either rejected or accepted in
probability sense. The main aim of test of hypothesis is to reject the null hypothesis.
Steps or procedure of hypothesis testing:
1. Set up a hypothesis:
The first step in hypothesis testing is to establish the hypothesis to be tested.
Hypothesis:
A hypothesis is an assumption about the parameter of the population or about the form of a
population to be tested.
For example,
  0 and  2   02 are two hypothesis.
Where,  = Population mean

 2 = Population variance
0 = a specified value of a population mean

 02 = a specified value of a population variance


Null hypothesis:
The hypothesis which we are going test for possible rejection under the assumption that it
is true is called null hypothesis or a hypothesis which states that there is no true difference
between assumed and actual value of the parameter is called the null hypothesis and it is denoted
by H 0
Alternative hypothesis:
The hypothesis which is complementary to the null hypothesis is called alternative
hypothesis and it is denoted by “H1” or “HA”.
2. Set up a suitable significance level:
Having set up a hypothesis the next step is to select a suitable level of significance.
Type I error: The error of rejecting null hypothesis when it is true is called type I error.
Type II error: The error of accepting null hypothesis when it is false is called type II error.

Decision H0 True H0 False


Accept H 0 Correct decision Type II error
Reject H1 Type I error Correct decision

The probability of type I error, denoted by  , is called the level of significance. The value of 
may be 1%, 2%, 5%, 10% etc. The probability of type II error is denoted by  and 1- is called
power of the test.
3. Determination of a suitable test statistic:
The third is to determine a suitable test statistic. The usual test statistic used in hypothesis testing
is Z -statistic, t- statistic, F- statistic &  2 -statistic.
4. Determine the critical Region/Critical Value:
It is important to specify, before the sample is taken, which values of the test statistic will lead to
a rejection of H0 when it is true and which lead to acceptance of H0. The former is called the critical
region.

5. Doing computation/Determine calculated value:


The fifth step in testing hypothesis is to determine the calculated value of test statistic.
6. Making decision:
Finally we may draw statistical conclusions and the management may take decisions.
If calculated value of test statistic  critical value of test statistic, null hypothesis will be
rejected. If calculated value of test statistic<critical value of test statistic, null hypothesis will be
accepted.
Two-tailed test:
Two-tailed test is that where the hypothesis is rejected for value of test statistic falling
into either two tails of the sampling distribution of the test statistic.
One-tailed test:
When the hypothesis is rejected only for value of test statistic falling into one of the tails
of the sampling distribution of the test statistic.
Power of the test:
Let the probability of type II error be  then 1- is called the power of the test.
Critical value or significant value:
The value of test statistic which separates the critical region and the acceptance region is
called the critical value or significant value.
Z-test/Normal-test/Large sample test:
Let U be a statistic. E(U) and  (U) be the expected value and standard deviation of U
respectively. If population standard deviation is known or estimated from large sample
n  30 then normal test or Z-test is defined as-
U  E U  U  E U 
Z , Or Z=
 U  estimated  U 
Z is distributed as normal with mean 0 and variance 1. If calculated value of z is greater than or
equal to critical value of Z at  level of is rejected otherwise null hypothesis is accepted.

Uses of normal test or Z-test:


i) Test of significance for single mean
ii)Test of significance of difference of means
iii)Test of significance for sample proportion
iv)Test of significance of difference of proportions
v) Test of significance of specified value of population correlation coefficient
vi)Test of significance of difference of correlation coefficients
(i) Test of significance for single mean:

Null hypothesis, H 0 :    0 (  0 is a specified value)

Alternative hypothesis is i. H A :    0
ii. H A :    0
iii. H A :    0
The required test statistic is given by
x  0
Z When  is known or

n
x  0
Z when  is unknown but estimated from large sample n  30
s
n
Problem1: A sample of 900 members has a mean 3.5cms, and s.d. 2.60 cms. Is the sample from
a large population of mean 3.25 cms. And s.d. 2.60?
Problem 2: The mean lifetime of a sample of 120 light tubes produced by a company is found to
be 1670 hours with standard deviation of 80 hours. Test the hypothesis that the mean lifetime of
the tubes produced by the company is 1650 hours.
Problem 3: A manufacturer of bolts claims that the mean length of bolts is 5.5 inch with a
standard deviation of 0.02 inch. A random sample of 42 bolts yields a mean of 5.71 inch. Do
these data provide sufficient evidence to indicate that the true mean length is equal to the mean
length claimed by the manufacturer?
(ii)Test of significance of difference of means:
Null hypothesis, H 0 : 1   2
Alternative hypothesis, i. H1 : 1   2
ii. H1 : 1   2
iii. H1 : 1   2
To test the above null hypothesis the required test statistic is given by
x1  x2
Z , when  are known or
 12  22

n1 n2
x1  x2
Z , when  are not known but estimated from large sample
s12 s22

n1 n2
Problem 4:
You are given the following information relating to purchase of bulbs from two manufactures A
and B:
Manufacturer No. of Bulbs bought Mean life S.D
A 100 2950 hrs. 100 hrs.
B 100 2970 hrs. 90 hrs
Is there a significance difference in the mean life of two makes of bulbs?
Problem 5:
I.Q. test on two groups of boys and girls gave the following results:
Girls: x  78, s.d .  10, n  40

Boys: x  85, s.d .  12, n  60


Is there a significance difference in the mean scores of boys and girls at 5% level of significance?
iii) Test of significance for sample proportion:
Null hypothesis: H 0 :    0 (where  is the proportion)

Alternative hypothesis: H 1 :    0

To test the above null hypothesis the required test statistic is given by
P 0 x
Z , where P=
 0 (1   0 ) n
n
Problem 6: 20 people were attacked by a disease and only 18 survived. Will you reject the
hypothesis that the survival rate, if attacked by this disease, is 85% in favor the hypothesis that is
more, at 5% level of significance?
Problem 7: A random sample of 100 seeds was taken from a large consignment for examination
and 15 were found to be defective. Can we accept the suppliers claim that the proportion of bad
seeds in the consignment is .03?
(iv) Test of significance of difference of proportions:
Null hypothesis, H 0 :  1   2 (where  is the proportion)

Alternative hypothesis, H1 :  1   2
To test the above null hypothesis the required test statistic is given by
P1  P2 x1 x n p  n2 p 2
Z , where p1  and p 2  2 and p= 1 1
1 1 n1 n2 n1  n2
pq (  )
n 1 n2
Problem 8: Before an increase in excise duty on tea, 800 persons out of a sample of 1000
persons were found to be tea drinkers. After an increase in duty, 800 people were tea drinkers in
a sample of 1300 people. State whether there is a significant decrease in the consumption of tea
after the increase in excise duty?
Problem 9: In a year of there are 956 births in town A of which 52.6% were males while in
towns A and B combined, this proportion in total of 1406 births was 0.495. Is there any
significant difference in proportion of male births in two towns?

t-test/ small sample test:


In normal test, we assume that  U  is either known or can be estimated from a large

sample n  30 . We may have to face some situations where the sample is not large enough and

 U  is not known. In such uses the estimate of  U  can be obtained and the test statistic
becomes-
U  E U 
t
Estimated  U 

Which is distributed as students t with `’ degree of freedom, where `’ is less than ‘n’.
Applications of t-test:
i) Test of significance for single mean
ii)Test of significance of difference of means
iii) Test of significance for sample correlation coefficient
iv)Test of significance of difference of means from correlated populations.
v) Test of significance of observed regression coefficient.
vi) Test of significance of difference of regression coefficients
vii) Test of significance for observed partial correlation coefficient
Test of single population mean:
We want to test null hypothesis, H 0 :    0
Alternative hypothesis, i. H 1 :    0
ii. H 1 :    0
iii. H 1 :    0
To test the above null hypothesis, the required test statistic will be
x  0
t , which is distributed as students ‘t’ with =n-1 degree of freedom.
s
n
Problem 1:
The weekly wages of 10 workers taken at random from a factory are given below:
Wages (Tk.): 578, 572, 570, 568, 572, 578, 570, 572, 596, 584
Is is possible that the mean wage of all workers of this factory is Tk. 580?
Problem 2:
The following are the weights (in lbs) of a random sample of 10 employees working in the
shipping department of a wholesale grocery firm: 154, 154,186,243,159,174,183,163,192,281.
On the basis of this data, can we conclude at the 0.05 significance level that the firm’s shipping
department employees have mean weight of 160 lbs?
Problem 3:
The daily sales (in Tk.) of a shop in a market are given below:
500 645 2100 1950 1715 895 1230 795 899
1230 1315 865 980 810 1815 1900 1400 1150
1125 1050 950 1150 1925 890 1600 1625 1750
Test the hypothesis that the average sale of the shop is Tk. 1400 on the basis of the above sample.

Test of two population means:


We want to test null hypothesis, H 0 : 1   2

Alternative hypothesis, i. H1 : 1   2
ii. H1 : 1   2
iii. H1 : 1   2
To test the above null hypothesis, the required test statistic will be
x1  x 2
t , which is distributed as students‘t’ with   n1  n2  2 degree of freedom.
s2 s2

n1 n2
2 n1  1s1  n2  1s 2
2 2

Here, s 
n1  n2  2
Problem 4:
Samples of two different types of bulbs were tested for length of life, and the following data
were obtained:
Type I (Hours): 1200, 1350, 1400,1320,1420,1410, 1100, 1300
Type II (Hours):: 1200, 1400, 1420, 1230, 1430, 1250, 1300, 1500, 1240
Is the significance difference of mean length of bulbs?
Problem 5:
Two types of batteries are tested for their length of life and the following data are obtained:
Type A (Hours): 450,500 600,900,850,750,930,750,800,750,630,450,620
Type B (Hours): 360,530,420,620,720,420,560,850,650,450,610,
Is there a significance difference in the two means?
Problem 6:
An equal opportunities committee is conducting an investigation if in comparable jobs; men and
women workers are paid identical wages. The following information is obtained on 15 males and
18 females:

Mean (in Tk.) S.D. (in Tk.)


Male 21530 5090
Female 20995 5060
Test whether any difference between the mean salaries of men and women?
Test of hypothesis of correlation Co-efficient:
We want to test null hypothesis, H 0 :   0 , where   Population correlation co-efficient

Alternative hypothesis, i. H I :   0

ii. H I :   0

iii. H I :   0
To test the above null hypothesis, the required test statistic is given by
r n2
t , Where r is sample correlation co-efficient
1 r 2
which is distributed as student ‘t’ with (n-2) d.f.

Problem 7:
In a study of the relationship between expenditure (X) and annual sales volume (Y), a sample of
10 firms yielded the co-efficient of correlation r=0.93 Can we conclude on the basis of this data
that X and Y are linearly correlated?
Problem 8:
The following figures relate to age and pressure of 10 women.
Age 56 42 36 47 49 42 60 72 63 55
Blood pressure 149 125 118 128 145 140 155 160 149 150
Can we conclude on the basis of above data that age and blood pressure are linearly correlated?

 2 -test

Karl Pearson first used it in 1900. The  2 test is one of the simplest and most widely used non-
parametric tests in statistical work. It makes no assumption about the population being sampled.
The quantity  2 describes the magnitude of discrepancy between theory and observation. If  2
is zero, it means that the observed and expected frequencies completely coincide. The greater the
value of  2 , the greater would be the discrepancy between observed and expected frequencies.

The formula for computing  2 is

 
2 O  E 2 , where O=observed frequency, E=expected frequency.
E
Degree of freedom:
The no. of degree of freedom is described as the number of observations that are free to vary
after certain restrictions have been imposed on the data.

Application of  2 -test:
i) To test single variance
ii) To test goodness of fit
iii)To test independence of attributes
iv)To test of significance of equality of several variances
v) To test of significance of equality of several proportions
vi)To test of significance of equality of several correlation coefficients
Assumptions for the application of  2 test:
The following conditions must be met in order for chi-square analysis to be applied.
i. The experimental data (sample observation) must be independent.
ii. The sample data must be drawn at random from the target population.
iii. The data should be expressed in original units for convenience of comparison, not in
percentage or ratio form.
iv. The sample should contain at least 50 observations.
v. There should not be less than five observations in any one cell (each data entry is
known as a cell).
vi. The constraint on the cell frequencies is O   E
Single variance test:
Let us suppose that we have a random sample of size ‘n’ consisting of x1,x2,x3,....xn drawn from a
2
normal population. We want to test the null hypothesis that the population variance is  0 .
2 2
i.e. null hypothesis, H 0 :    0 , where  0
2
is a specified value of  2 alternative
2
i. H1 :    0
2
hypothesis,
2
ii. H1 :    0
2

2
iii. H1 :    0
2

To test the above null hypothesis, the required test statistic is given by

2 
n  1s 2
2 , whichf is distributed as  2 with (n-1) degree of freedom.
0

If  2   21 / 2 and  2   2 / 2 we reject null hypothesis otherwise we accept the null hypothesis.
Problem 1:
Weights in kilograms of 10 shipments are given below: 38,40,45,53,47,43,55,48,52,49.
Can we say that variance of the distribution of weights of all shipments from which the above
sample of 10 shipments was drawn is equal to 20 square kilogram?
Problem 2:
Prices of shares of a company on the different days in a month were found to be
66,65,69,70,69,71,70,63,64 and 68. Test whether the variance of the shares in the month is 9.
Test for independence of attributes:
One of the most frequent uses of  2 is for testing the null hypothesis that two criteria of
classification are independent. They are independent if the distribution of a criterion in no way
depends on the distribution of the other criterion. If they are not independent, there is an
association between the criteria.
Let us designate the two attributes are A and B where attribute A is assumed to have ‘r’
categories and attribute B is assumed to have ‘c’ categories. Furthermore, assume the total
number of observations in the problem is N.
B B1 B2 ..... ........ Bj Bc Total
A ..... ........
A1 O11 O12 .... ... O1j ... O1c R1
A2 O2 O22 .... ... O2 ... O2 R2

Ai Oi1 Oi2 .... ... Oij ... Oic Ri

Ar Or1 Or2 .... ... Orj ... Orc Rr


Total C1 C2 .... .... Cj ... Cc  R  C
i j N

We want to test the null hypothesis, H 0 : A and B are independent.


Alternative hypothesis, H 1 : A and B are not independent.

To test the above null hypothesis, the required test statistic is given by,
O 2ij
  
2
 N , Where, Oij  Observed frequency in the ith row and jth column
i j Eij

Rj  C j
Eij  Expected frequency  , Ri is the row total and Cj is the column total
N
which follows  2 distribution with   r  1c  1d . f .

If the calculated of  2 is greater than or equal to critical value, then null hypothesis will be

rejected otherwise accepted.

Contingency Table:
Contingency table, sometimes called a two-way frequency table or cross tabulation or
crosstab, is a tabular mechanism with at least two rows and two columns used in statistics to
present categorical data in terms of frequency counts. More precisely, an contingency table
shows the observed frequency of two variables, the observed frequencies of which are arranged
into “r” rows and “c” columns. The intersection of a row and a column of a contingency table is
called a cell. Problem-1 is an example of a 2×2 contingency table
Problem 1: A sample of 200 people with a particular disease was selected. Out of these, 100
were given a drug and the others were not given any drug. The results are as follows:
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug is effective or not.
Problem 2:
The following contingency table shows the classification of 2000 workers in a factory, according
to the disciplinary action taken by the management and promotional experience:
Promotional experience
Disciplinary
Promoted Not promoted
Not-offenders 146 462
Offenders 54 1338

Test whether the disciplinary action taken and promotional experience are independent.
Problem 3:
Based on information on 1100 randomly selected fields about the tenancy status of the
cultivators of these fields and use of fertilizers collected in an agro economic enquiry, the
following classification was noted:
Owned Rented
Using fertilizer 516 184
Not using fertilizer 64 336

Would you conclude that owner-cultivators are more inclined towards the use of fertilizers?
Problem 4:
Price of a basket of goods and services showed the following trend in up-contry and midtown
markets:

Up country Trend
Increasing Not increasing
Increasing 60 31
Mid-town Trend
Not increasing 15 5
Show of the trends in up-country prices and in mid-town prices has any significant association.
F-test
The F-test is based on F-distribution which was named in honor of R.A Fisher who first
introduced it in 1924. F-distribution is usually defined in terms of the ratio of the variances of
2
s1
two normally distributed populations. Therefore the F-test is defined as F  2 which follows
s2
F-distribution with ( n1  1 ) and ( n2  1 ) degree of freedom.
Applications of F-test:
(i) Test of significance for equality of two population variances
(ii) Test of significance for homogeneity of population means
(iii) Test of significance of an observed correlation ratio
(iv) Test of significance of linearity of regression
(v) Test of significance of an observed multiple correlation coefficient.
Testing the hypothesis for equality of two variances:
2 2
We want to test the null hypothesis, H 0 : 61  6 2
2 2
Alternative hypothesis, H 1 : i. 61  6 2 ,
2 2
ii. 61  6 2
2 2
iii. 61  6 2
To test the above null hypothesis the required test statistic is given by

s  , which follows F-distribution with ( n  1 ) and ( n


2
s1 2 2
F 2 1  s2 1 2  1 ) degree of
s2

s 
2
s2 2 2
freedom or F  2 2  s1 , which follows F-distribution with ( n2  1 ) and ( n1  1 ) degree of
s1
freedom.
Problem 1:
Two sources of raw materials are under considered by a company. Both sources seem to
have similar characteristics but the company is not sure about their respective uniformity. A
sample of 10 lots from sources A-yields a variance of 225 and a sample of 11 lots from sources
B-yields a variance of 200.Is it likely that the variance of sources A is significantly greater than
the variance of source B?
Problem 2: A sample of the monthly earnings records of 15 employees of company A has a
variance of Tk. 15.90 while a similar sample of 27 employees for company B has a variance of
Tk. 17.50. Is it safe to assume that there is less variance in company A than in company B?
Problem 3:
Samples of two different types of bulbs were tested for length of life, and the following data
were obtained:
Type I (Hours): 1200, 1350, 1400,1320,1420,1410, 1100, 1300
Type II (Hours):: 1200, 1400, 1420, 1230, 1430, 1250, 1300, 1500, 1240
Is the significance difference of variation of life time of bulbs?
Problem 4:
Two types of batteries are tested for their length of life and the following data are obtained:
Type A (Hours): 450,500 600,900,850,750,930,750,800,750,630,450,620
Type B (Hours): 360,530,420,620,720,420,560,850,650,450,610,
Is there a significance difference in the two variances of length life time?
Analysis of variance:
Analysis of variance is a technique of partitioning the total sum of square deviation of all
sample values from the grand mean into different variations for which different factors are liable
and it is used to test of hypothesis for F-test.
One-way classification:
When data obtained from a population can be classified and arranged on the basis of one
factor only, the data referred to as one-way classification.
In One-way classification,
Total variation=Sum of square due to treatment+ Sum of square due to error.
Null Hypothesis, H 0 : 1   3  ..........   k
Alternative hypothesis, H I : 1   2  ............   k
Analysis of variance table for one-way classification

Source of Degree of Sum of square Mean sum of Calculated F


variation freedom square
Treatment k-1 SSTr SSTr MSTr
 MSTr F
k 1 MSE
Error n  1  k  1  n  k SSE SSE
 MSE
nk
Total n-1 SST=SSTr+SSE
Problem 4:
Miss Fatema a supervisor, has 3 typists working under her supervision. She is concerned with the
time they spend on the tea in addition to the normal lunch tea break. Her observations recorded
in minutes for each girl are as follows:
A 25 18 30 32 35 37 19
B 24 22 26 28 30 32 28 26
C 28 20 27 19 29 35 30 23 27 32
Can the difference in average time that the three typists spend on tea break be explained by
chance variation?
Two ways Classification:
When data obtained from a population can be classified and arranged on the basis of two
factors only, the data referred to as two-way classification.
In Two-way classification,
Total variation=Sum of square due to first factor + Sum of square due to second factor+ Sum of
square due to error.
Null Hypothesis, H 0 : Regression is insignificant

Alternative hypothesis, H I : Regression is significant


.

Problem 5:
The following data represent the no. of unit of production per day turned out by 5 different
workers using 4 different types of machines:
W Machine Type
O A B C D
R 1 48 36 48 38
K 2 48 40 50 44
E 3 37 38 40 36
R 4 43 34 45 32
S 5 40 47 51 42
i) Test whether the mean productivity is the same for 4 different machine types
ii) Test whether the 5 workers differ with respect to mean productivity.
Problem 6:
The following table gives monthly sales (in thousand rupees) of a certain firm in three States by
its four salesmen:
States Salesman
I II III IV
A 10 7 6 8
B 8 9 6 5
C 11 7 8 8
i. Test whether there is any significant difference of firm salesman
ii. Test whether there is any significant difference of the three States.

Problem 7:
Following data gives the number of refrigerators sold by 4 salesmen in three months:
Salesman
Month A B C D
January 52 43 48 39
June 45 48 50 45
December 41 45 41 48

Determine whether there is any difference in the average sales made by four salesmen.

Correlation Matrix:
A correlation matrix is a table showing correlation coefficients between sets of variables.
Each random variable (Xi) in the table is correlated with each of the other values in the table (Xj).
This allows you to see which pairs have the highest correlation.

A correlation matrix showing correlation coefficients for combinations of 5 variables B1:B5.


The diagonal of the table is always a set of ones, because the correlation between a variable and
itself is always 1. You could fill in the upper-right triangle, but these would be a repeat of the
lower-left triangle (because B1:B2 is the same as B2:B1); In other words, a correlation matrix is
also a symmetric matrix.
Multivariate Regression Analysis
As an example in a sample of 50 individuals we measured: Y = toluene personal
exposure concentration (a widespread aromatic hydrocarbon); X1 = hours spent outdoors;
X2 = wind speed (m/sec); X3 = toluene home levels. Y is the continuous response
variable ("dependent") while X1, X2, , Xp as the predictor variables ("independent").
Usually the questions of interest are how to predict Y on the basis of the X's and what is
the "independent" influence of wind speed, i.e. corrected for home levels and other
related variables? These questions can in principle be answered by multiple linear
regression analysis.
In the multiple linear regression model, Y has normal distribution with mean

The model parameters β0 + β1 + +βρ and σ must be estimated from data.


β0 = intercept
β1 ……………..βρ = regression coefficients
σ = σres = residual standard deviation
Interpretation of regression coefficients
In the equation Y = β0 + β11 + +βρXρ
β1 equals the mean increase in Y per unit increase in Xi , while other Xi's are kept
fixed. In other words βi is influence of Xi corrected (adjusted) for the other X's. The
estimation method follows the least squares criterion.
If b0, b1, , bρ are the estimates of β0, β1, , βρ then the "fitted" value of Y is
The b0, b1, ... , bp are computed such that to be minimal. Since Y –
Yfit is called the residual; one can also say that the sum of squared residuals is minimized.
In our example, the statistical packages give the following estimates or regression
coefficients (bi) and standard errors (se) for toluene personal exposure levels.
Problem:
The internal revenue service wants to estimate actual unpaid taxes discounted (Y) on field
audit labor hours (X1) and computer hours (X2) during last 20 months. The predicted linear
regression plane and SS (total) and SS(regression) are given below:
Yˆ = -10.828+ 0.684 X1 + 1.2 X2.
SS (total) = 290.60 and SS (regression) = 280.12
i) Construct an analysis of variance table.
ii) Is regression significant? Will you go for t-test?
iii) Calculate percentage amount of variation explained by the plane and comment.
iv) Estimate actual unpaid taxes if field audit labor and computer hours 40 and 20
respectively.

You might also like