0% found this document useful (0 votes)
32 views77 pages

Lecture Note CH 7 Ed Ok1

This document discusses processing and analyzing quantitative data. It identifies key considerations for preparing data for analysis, including editing data, handling blank responses, coding data, categorizing variables, entering data into software, and cleaning data. The document also discusses summarizing data using tables and graphs to facilitate interpretation. The overall goal of data processing is to organize raw data for statistical analysis and interpretation of results.

Uploaded by

tamirubeleten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views77 pages

Lecture Note CH 7 Ed Ok1

This document discusses processing and analyzing quantitative data. It identifies key considerations for preparing data for analysis, including editing data, handling blank responses, coding data, categorizing variables, entering data into software, and cleaning data. The document also discusses summarizing data using tables and graphs to facilitate interpretation. The overall goal of data processing is to organize raw data for statistical analysis and interpretation of results.

Uploaded by

tamirubeleten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

CHAPTER SEVEN

PROCESSING, ANALYSING AND


INTERPRETATION OF DATA
Chapter objectives

• Identify the main issues that you need to


consider when preparing data for analysis;
• Distinguish between parametric and non-
parametric tests;
• Illustrate how to apply statistical tools for
quantitative data analysis;
• Illustrate the primary application of SPSS for
data entry and organization.

2
7.1. Introduction
• After data have been collected, they must be organized and analysed using
various statistical tools and techniques
Flow diagram of data analysis and interpretation

Data Analysis
Data Collection

Interpretation
of results

Feel for data


Data processing:
getting data ready (mean, standard
for analysis deviations, Research
•Editing data frequency
Discussion
•Handling blank
question
distributions,
spaces answered
•Coding data correlations) ,
•Categorizing data goodness of
•Creating data file data, hypotheses
programming testing
3
7.2. Data processing
Editing data: is the process of checking and adjusting the data for
omissions, legibility, and consistency.
 It detects errors and omissions, corrects them when possible, and
certifies that minimum data quality standards are achieved.
 The editing purpose is to assure that data are (1) accurate, (2)
consistent with other information, (3) uniformly entered, (4)
complete, and (5) arranged to simplify coding and tabulation.
 Especially for responses to open-ended questions of interviews and
questionnaires, or unstructured observations
 It should be done at the same day that the data are collected so
that the respondents may be contacted for any further information
and clarification, as needed.
 The edited data should be identifiable through the use of different
color pencil or ink so that the original information is still available in
case of further doubts later on.
By Deribe A. 4
Data processing-cont’d

• Whenever possible, it would be better to follow up with the


respondent and get the correct data while editing.
• If editing task is not appropriately done, the validity and
reliability of the study could thus be impaired.
Handling blank responses:
• Not all respondents answer every item in the questionnaire.
• Answers may have been left blank because respondent:
 Did not understand the question,
 Did not know the answer,
 Was unwilling to answer,
 Was simply indifferent to responding to entire questionnaire

5
Data processing-cont’d
• If a substantial number of questions-say, 25 percent of the items in the
questionnaire- have been left unanswered, it may be advisable to throw the
questionnaire and not include in the data set for analysis.
 Important to mention the number of returned but unused responses due to
excessive missing data in the final report
• If, however,, only two or three items are left blank in a questionnaire with, say,
30 or more items, a decision must be made about how these blank responses
are to be handled.
 Assign the midpoint in the scale for an interval-scaled item.
 Ignore the blank responses when the analyses are done (this, of course, will
reduce the sample size whenever that variable is involved), the best way of
handling missing items in case of sample size is large.
 Assign to the item the mean value of the responses of all those who have
responded to that particular item.
• Treat the ‘don’t know’ responses as that of missing items

6
Data processing- Cont’d
• The next step is to code the responses using numerical codes.( coding at the time of
designing the questionnaire or after the data collection.
• Coding: the process of identifying and classifying each answer with a numerical
score or other character symbol.

1.• Age
The(years) 2. Education
responses 3. Job
to demographic level can4.be
variables Sex 5. Work shift
coded as follows: 6.
Employment
status
[1] under 25 [1] high school [1] manager [1] M [1] first shift [1] part time
[2] 25-35 [2] Diploma [2] supervisor [2] F [2] second [2] full time
[3] 36-45 [3] Bachelor’s [3] Clerk [3] third
degree
[4] 46-55 [4] Master’s [4] Secretary
degree
[5] over 55 [5] doctoral [5] Technician
degree
[6] Other [6] other
(specify] (specify)

7
Data processing- Cont’d

Example: The following questions are used to measure Involvement and satisfaction variables
• To what extent would you agree with the following statements, on the scale of 1 to 7, 1 denoting very
low agreement, and 7 denoting very high agreement?
1 2 3 4 5 6 7
6. The major happiness of my life comes from my job
7. Times at work flies by quickly
8. I live, eat, and breathe my job
9. My work is fascinating
10. My work gives me a sense of accomplishment
11. My supervisor praises good work
12. The opportunities for advancement are very good here
13. My coworkers are very stimulating
14. People can live comfortable with their pay in this organization
15. I get a lot of cooperation at the workplace
16. My supervisor is not very capable
17. Most things in life are more important than work
18. Working here is a drag
19. The promotion policies here are very unfair
20. My pay is barely adequate to take care of my expenses
21. My work is not the most important part of my life 8
Data processing- Cont’d
• The purpose of coding responses from open-ended questions is to
reduce the large number of individual responses to a few general
categories of answers that can be assigned numerical score.

Note: The usual reason for using open-ended questions is that the research
has no clear hypothesis regarding the answers.

9
Data processing- Cont’d
Categorization:
• It involves categorizing the variables such that the several items measuring a
concept are all grouped together.
• Responses to some of the negatively worded questions have to be reversed so
that all answers are in the same direction.
 This can be done on the computer through a RECODE.
Entering data:
• Raw data can be entered (manually or using scanner sheet-a machine-readable
form) through any soft ware program. For instance, the SPSS Data Editor:
 It looks like a spread sheet
 Can enter, edit, view of the contents of the data file
 Each row of the editor represents a case, and each column represents a variable
 All missing values will appear with a period (dot) in the cell.
Data cleaning: check for wrongly coded variables- a check to make sure that all
codes are legitimate. For example, if sex is coded 1= male and 2=male and a 3
code is found, it is obvious that a mistake has been made that requires an
adjustment.
10
Data processing- Cont’d
. Data summary code sheet

Responde Total
nts
Part I Part II
Living Fulfilling
Sex together needs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 0 0 1 4 5 5 5 4 4 4 5 4 4 3 4 5 4 3 63

2 1 10 2 2 2 3 2 1 1 2 2 2 2 3 2 2 2 3 31

228

229

11
7.3. Data Presentation
• Data in raw form are usually not easy to use for decision making
• Some type of organization is needed
• Table
• Graph
• Data presentation: The process of transforming a mass of raw data into
tables and charts-as a part of making sense of the data.
• refers to the preparation of data in a manner that could be used by
general audience
• Tables:
 They can be used with just about all types of numerical data.

• Graphical
• The type of graph to use depends on the variable being summarized
12
Data presentation: The Frequency
Distribution Table

Summarize data by category

Example: Hospital Patients by Unit


Hospital Unit Number of Patients
Cardiac Care 1,052
Emergency 2,245
Intensive Care 340
Maternity 552
Surgery 4,630

(Variables are
categorical)
13
Data presentation-Cont’d

• Tables:
Example for tabulation of data: A small survey was carried out into the mode of travel to work by taking a
random sample of 20 employed adults.

Person Mode of travel Person Mode of travel


1 car 11 car
2 car 12 bus
3 bus 13 walk
4 car 14 car
5 walk 15 train
6 cycle 16 bus
7 car 17 car
8 cycle 18 cycle
9 bus 19 car
10 train 20 car

14
• How would you classify this data?
Data presentation-Cont’d
• This data is categorical (nominal ) since mode of travel does not
have a numerical value. This information would be better
displayed as a frequency table:
Table 7.1: frequency Mode of travel
Mode of travel Frequency Relative frequency (%)
Car 9 45
Bus 4 20
Cycle 3 15
Walk 2 10
Train 2 10
Total 20 100

• Frequency: the number of times each category appeared


• Ordering by descending size of frequency makes comparison clearer

15
Graphical
Presentation of Data
(continued)

On the basis of types of variables:


Categorical Numerical
Variables Variables

• Frequency distribution • Line chart


• Bar chart • Frequency distribution
• Pie chart • Histogram
• Scatter plot

16
Tables and Graphs for Categorical
Variables

Categorical Data

Tabulating Data Graphing Data

Frequency
Distribution Table Bar Chart Pie Chart

17
Data presentation-Cont’d
• Diagrammatic representation of data (bar charts, pie charts, histogram, line graphs,
frequency polygon)
• Bar charts:
 A bar chart is a graph that shows the frequency distribution of a variable.
 they can be used with nominal and with discrete data
 Bars should be of equal width, with the height of the bars representing the frequency
(height of the bar is proportional to frequency) or the amount for each separate
category.
 For each category a vertical bar is drawn
 There is a gap between each bar.
Types of bar charts: simple bar chart, multiple bar chart, component bar chart
 A simple bar chart: shows the total of each category
 A multiple bar chart is used when you are interested in changes in the components but
the totals are of no interest
 A component bar chart: this helps to compare totals and seeing how the totals are
made up a component bar chart.

18
Simple Bar Chart Example
Hospital Number
Unit of Patients
Cardiac Care 1,052
Emergency 2,245 Hospital Patients by Unit
Intensive Care 340 5000

Maternity 552 4000


Surgery 4,630 patients per year
Number of

3000

2000

1000

0
Cardiac

Emergency

Intensive

Surgery
Maternity
Care

Care
19
A multiple bar Chart Example
• Sales by quarter for three sales territories:
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Ea st 20.4 27.4 59 20.4
W e st 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9

60

50

40
East
30 West
North
20

10

0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
20
A Component bar Chart Example
• Sales by quarter for three sales territories:
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Ea st 20.4 27.4 59 20.4
W e st 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9

160
140
120
100
North
80 West
60 East

40
20
0
1st Qtr 2nd Qtr 3rd Qrt 4th Qtr
By Deribe A. 21
Data Presentation
Pie chart:
• Presents data as segments of the whole pie.
• Each category is represented by a segment of a circle.
• The segments are presented in terms of percentages
• The size of each segment reflects the frequency of that category
and can be represented as an angle.

22
Pie Chart Example

Hospital Number % of
Unit of Patients Total
Hospital Patients by Unit
Cardiac Care 1,052 11.93
Emergency 2,245 25.46 Cardiac Care
12%
Intensive Care 340 3.86
Maternity 552 6.26
Surgery 4,630 52.50

Emergency
Surgery 25%
53%

Intensive Care
(Percentages are 4%
rounded to the Maternity
nearest percent) 6%

23
Graphs to Describe
Numerical Variables

Numerical Data

Frequency Distributions and


Cumulative Distributions

Histogram

24
Graphs for Time-Series Data
• A line chart (time-series plot) is used to show
the values of a variable over time

• Time is measured on the horizontal axis

• The variable of interest is measured on the


vertical axis

25
Line Chart Example

Magazine Subscriptions by Year

350

300
Thousands of subscribers

250

200

150

100

50

0
1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006
26
Data presentation-Cont’d
 A frequency distribution (or frequency table) is a set of data which
records the number of times a particular vlaue of a variable, or
range of values of a variable, occurs

 Example:

Amount Deposited (Rs.) Frequency

Less than 50,000 6700


50,000 – 100,000 1240
Above 100,000 375
Total = 8,315
27
Histogram
• A graph of the data in a frequency distribution is called a
histogram
• The interval endpoints are shown on the horizontal axis
• the vertical axis is either frequency, relative frequency, or
percentage
• Bars of the appropriate heights are used to represent the
number of observations within each class
• Differences from bar chart:
 The horizontal axis is a continuous scale, just like a normal
graph-there should not be gaps between bars
 It is the area of the bars that is being compared, not the heights

28
Histogram Example

Interval Frequency
His togram : Daily High Te m pe rature
10 but less than 20 3
20 but less than 30 6 7 6
30 but less than 40 5
40 but less than 50 4
6 5
50 but less than 60 2 5 4
Frequency

4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 10 1020 20
30 4030 50 40 6050 70 60
bars) Temperature in Degrees
29
Data presentation-Cont’d

Exercise for Excel application


Sales by department (by Dollar in millions)
2007 2008 2009
Clothing 1.7 1.4 1.4
Furniture 3.4 4.9 5.6
Electrical goods 0.2 0.4 0.5
Total 5.3 6.7 7.5

Create:
1. A pie chart for sales in 2009
2. A simple bar chart of total sales for each of the three years
3. A multiple bar chart for sales by year and department
4. A component bar chart
5. A percentage bar chart
6. A line graph for the total sales over the three years
30
Data presentation-Cont’d

Essentials for the presentation of tables and charts :


 present enough information without ‘drowning’ the reader with
information overload;
o a title;
o Information about the units being represented in the columns of
the table or the axes of the chart (sometimes this is placed by the
axes, sometimes by the bars or lines, sometimes in a legend);
Note:
 the horizontal axis is the ‘x axis’ and is used for the independent
variable;
 The vertical axis is the ‘y axis and is used for the dependent variable.
o The source of the data, if they were originally produced elsewhere
 Help the reader to interpret the table or chart through visual clues
and appropriate presentation;
 Use an appropriate type of table or chart for the purpose at hand.31
Data presentation-Cont’d

Conventions important in constructing visual presentations:


1. Provide a descriptive title for each table. This applies as well to
charts, graphs, and figures.
2. Label variables and variable categories. All rows and columns
should also be fully labeled.
3. The sources of the data should be indicated.
4. All terms open to interpretation should be defined in the
footnote section below table.

32
7.4. Data Analysis: Basic concepts
Some terminologies:
• Data analysis: is the application of logic to understand and
interpret the data that have been collected about a subject. It
involves determining consistent patterns and summarizing the
appropriate details revealed in the investigation.
 statistical analysis may range from portraying a simple frequency
distribution to very complex multivariate analysis, such as multiple
regression.
• Univariate analysis is the examination of the distribution of cases
on only one variable at a time. Example:
Awareness of the No. of respondents Frequency
HIV/Aids
Aware 60 60
unaware 40 40
Total 100 100
33
Data Analysis: Basic concepts
Some terminologies:
• Bivariate analysis: an analysis of the relationship between two
variables. Example:
Awareness of the Men Women Total
HIV/Aids
Aware 50 10 60
Unaware 15 25 40
Total 65 35 100

• Multivariate analysis: Statistical techniques involving more than one


variable at once.
 Cross-tabulation (also known as a contingency table): a table or set of
tables which show the number or percent of observations (i.e.’
frequencies) existing at every combination of the levels of two or more
variables. Contingency table- values of the dependent variable are
contingent on values of the independent variable.
34
Data Analysis: Basic concepts

• Tabulation: refers to the orderly arrangement of data in a table or other


summary format. Counting the number of responses to a question and
putting them in a frequency distribution is a simple, or marginal tabulations.
Example: Do you support parliamentary system?, n=450
Responses Frequency
Yes 330
No 120
Total 450

• Tallying: tabulation process by hand, Tabulation can be done by computer.


• Cross-tabulation (also known as a contingency table): is a technique for
comparing two classification variables.
 Most statistical analysis software (e.g., Chi square) allows you to add totals,
and row and column percentage when designing your table 35
Data analysis: Basic concepts-Cont’d
Example: ‘ Do you support the proposition that men and women should
be treated equally in all regards?, n= 450
Yes No Row Total
225
Men 150 75
Women 180 45 225
C. Total

330 120 450


• The frequency counts for the question ‘-----’ are represented as column totals
• The total number of men and women in the sample are presented as row totals.
• These row and column totals are often called marginals, because they appear in the
table’s margin.
• In the above table, there are four cells, each representing a specific combination of
the two variables. E.g., the cell representing women who said they do not support
the proposition has a frequency count of 45.
• Any cross-tabulation table may be classified according to the number of rows and
columns (R by C). In the above case the table is referred to as a “2 X 2” because it has
36
two rows and two columns.
Data analysis: Basic concepts-Cont’d
• Descriptive analysis refers to the transformation of raw data in the form that will make
them easy to understand and interpret.
• Used to summarize and describe the data on cases included in a study.
• The calculation of averages, frequency distributions, and percentage distributions is the
common form of summarizing data- one form of analysis

Type of Type of descriptive analysis


measurement

Frequency table, proportion


Two categories (percentage), mode
Nominal
Frequency table, category proportion
More than two (percentage), mode
categories
Rank order, median + the above
Ordinal

Arithmetic mean + the above


Interval
Index numbers, Geometric mean and
+ the above
Ratio
37
Descriptive analysis-Cont’d
• Descriptive statistics gives numerical and graphic procedures to
summarize a collection of data in a clear and understandable way.
• Descriptive statistics are a way of summarizing the complexity of
the data with a single number.
• The simplest method of analysis
• It either analyse the responses in percentages or will contain
actual number
• Concerned with the development of certain indices from the raw
data
• The important statistical measures that are used to summarise the
survey research data are:
A. Measures of central tendency or statistical averages
B. Measures of dispersion
C. Measures of asymmetry (skewness)
D. Measures of relationship 38
Descriptive analysis-Cont’d
A. Measures of central tendency or statistical average
 Also known as known as statistical averages
 The purpose of measures of central tendencyis to determine the average value in a set
of values.
 Mean, median and mode are the most popular averages
Mean
• Also known as arithmetic average
• The most common measure of central tendency
• The average of all values in a set of data
• Calculated by adding all the values in the group and then dividing by the number of
values
• Helps to summarising the essential features and enables comparison
Median
• Is the value of the middle item of series when it is arranged in ascending or descending
order
• It divides the series into two half
• It is positional average
Mode
• Mode is the frequently occurring value in a series - maximum frequency
– The mode in a distribution is that item around which there is maximum
39
concentration
Example: Measures of Central Tendency
(Arithmetic Mean)
 The arithmetic mean is the average of all the values under
consideration
Branch Revenue

1 50,000,000
2 150,000,000
3 40,000,000
4 60,000,000
Total = 300,000,000
Arithmetic
ArithmeticMean
Mean==300,000,000
300,000,000//44 ==75,000,000
75,000,000
40
Example: Measures of Central Tendency
(Median)
 The Median is the midpoint of the distribution of values under
consideration

Salesperson Number of Sales


Calls
1 4
2 3
Median
Median==33
3 2
4 5
5 3 11 22 33 33 33 44 55 55
6 3
7 1
8 5

41
Example: Measures of Central Tendency
(Mode)
 The Mode is the value that occurs most frequently in the
distribution of values under consideration

Salesperson Number of Sales


Calls
1 4
2 3
Mode
Mode==33
3 2
4 5
5 3
6 3
7 1
8 5

42
Descriptive analysis-Cont’d

B. Measures of dispersion
• An average can represent a series only as best as a single figure can, but it certainly cannot
reveal the entire story of any phenomenon under study
• Shows the degree by which numerical data tend to spread around an average value/mean
.
• Averages do not tell anything about the scatterness of observations within the distribution.
• In order to measure the degree of scatter, the statistical device called measures of
dispersion are calculated
• Important measures of dispersion are:
 Range = highest value – lowest value
 It shows the difference b/n the highest value and the lowest value, hence it is
the weakest measure of dispersion
• Mean deviation
– First calculate the mean, then deduct the mean from each value in the group
and divide the result by the number of values
• Variance
– First calculate the mean, then deduct the mean from each value in the group
square the result and divide the result by the number of values
• Standard deviation
– The most reliable measurement of the degree to which the data is spread
around the mean 43
Justifications for ‘Dispersion’ measures
• Averages are representatives of a frequency distribution but they fail to give a
complete picture about the distribution. Suppose that we have the distribution
of the yields (Kg per plot) of two wheat varieties from 5 plots each. The
distribution may be as follows:
Variety I 45 42 42 41 40

Variety II 54 48 42 33 30

• It can be seen that the mean yield for both varieties is 42 Kg. But we cannot say
that the performance of the two varieties are the same. There is greater
uniformity of yields in the first variety whereas there is more variability in the
yields of the second variety. The first variety may be preferred since it is more
consistent in yield performance
• From the above example, it is obvious that a measure of central tendency alone
is not sufficient to describe a frequency distribution.
• In addition to it we should have a measure of scatterness of observations
• The scatterness or variation of observations from their average is called the
dispersion.
44
Descriptive analysis-Cont’d
C. Measures of asymmetry (skewness)-it measures the shape of
distribution
• The shape of the distribution is said to be symmetric if the observations are
balanced, or evenly distributed, about the center.

Symmetric Distribution

10
9
8
7
Frequency

6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9

45
Descriptive analysis: Measures of asymmetry
(continued)
• The shape of the distribution is said to be skewed if the
observations are not symmetrically distributed around the
center.
Positively Skewed Distribution

A positively skewed distribution (skewed 12

to the right) has a tail that extends to the


10

Frequency
right in the direction of positive values. 6

0
1 2 3 4 5 6 7 8 9

A negatively skewed distribution


Negatively Skewed Distribution
(skewed to the left) has a tail that 12
extends to the left in the direction of 10

negative values. 8
Frequency

0
1 2 3 4 5 6 7 8 9 46
Descriptive analysis-cont’d
D. Measures of relationship:
• Need to determine whether there is a
relationship between variables

47
Correlation (cont.)
• Magnitude
• Direction

48
Data Analysis-Cont’d: Inferential analysis
• Inferential analysis:
 Provides procedures to draw inferences about a population from a sample.
 It is concerned with the various tests of significance for testing hypotheses in
order to determine with what validity data can be said to indicate some
conclusion or conclusions.
 It is concerned with the estimation of population values.
 It is mainly on the basis of inferential analysis that the task of interpretation is
performed.
Examples:
• The demand for a new Product X based on a sample conducted in Region Y
 The general election result based on a representative survey of voters in electoral district Z

• Used for drawing conclusion about the population from a


sample:
– Estimation
• Estimate true value of the parameter from a sample
– Hypothesis testing
• Determine if there is a difference in a parameter value for two groups.49
Data Analysis: Inferential analysis
• Inferential analysis:
• Inferential statistics answer the question, "To what extent can these findings
be GENERALIZED? (Can we infer that they are probably true for the whole
population, not just the sample?
 Involves using obtained sample statistics to estimate the corresponding
population parameters
• Are measures of the SIGNIFICANCE of the relationship between two or more
variables. Significance refers to the probability that the findings could be
attributed to sampling error.
• Appropriate statistics depend on the LEVEL OF MEASUREMENT OF THE
DEPENDENT VARIABLE (and of the independent variable).

Nominal & Ordinal Interval & Ratio


X2 --Chi Square •T-Tests (2 groups)

• ANOVA (F-ratio)
 Non-parametric test  Parametric test
50
Statistical significance- 2-2
• Following the sampling-theory approach, we accept or reject a
hypothesis on the basis of sampling information alone.
• Since any sample will almost surely vary somewhat from its
population, we must judge whether these differences are
statistically significant or insignificant.
 A difference has statistical significance if there is good reason to
believe the difference does not represent random sampling
fluctuations only.
• Tests of significance needs the understanding of the logic of
hypothesis testing: two kinds of hypotheses are used-null and
alternative
• Alternative hypotheses correspond with two-tailed and one-
tailed tests.
• If the null hypothesis is rejected, an alternative hypothesis is
accepted. 51
Statistical significance-Cont’d
• The null hypothesis is tested at a particular significance level.
 This relates to area (or probability) in the tail of the distribution
being used for the test.
 This area is called the critical region, and if the test statistic lies in
the critical region, you would infer that the result is unlikely to
have occurred by chance. You would then reject the null
hypothesis.
 For example, if the 5% level of significance was used and the null
hypothesis was rejected, you would say that H0 had been rejected
at the 5% (or the .005) significance level, and the result was
significant.
 The more significant a result, the more likely that it represents
something genuine.

52
Inferential Analysis
Testing Hypothesis:
• A hypothesis is established, it is rejected or accepted/fails to be
rejected, based on sample data collected.
• Since any sample will almost surely vary from somewhat from its
population, we must judge whether these differences are
statistically significant or insignificant.
• A difference has statistical significance if there is good reason to
believe the difference does not represent random sampling
fluctuations only.
Statistical testing procedures:
1. State the null hypothesis: while the researcher is usually
interested in testing a hypothesis of change or differences, the
null hypothesis is always used for statistical testing purposes.

53
Inferential Analysis
Statistical testing procedures---Cont’d
2. Choose the statistical test: on the basis of measurement scale
used & other factors
3. Select the desired level of significance (α): The choice of the level
of significance should be made before we collect data. The most
common level is .05, although .01 is also widely used.
4. Compute the calculated difference (XX) value: After the data are
collected, use the formula for the appropriate significance test
to obtain the calculated value.
5. Obtain the critical test value: After we compute the calculated t,
z, X2, or other measure, we must look up the critical value in the
appropriate table for that distribution. The critical value is the
criterion that defines the region of rejection from the region of
acceptance of the null hypothesis.
54
Inferential Analysis
Statistical testing procedures---Cont’d
Note: There is a different distribution for X2 for each number of degrees of
freedom (d.f), defined as the number of categories in the classification
minus one (k-1).
For instance, for four numbers with their sum is 10, there will be a freedom
of choice for the first three digits, but the fourth value is not free to vary
(4,2,1, X: sum=10)
• It is the number of observations that can be varied without changing the
restraints or assumptions associated with a numerical system.
 With chi-square contingency tables of the two samples or k-sample
variety, we have both rows and columns in the cross-classification table.
In that instance, d.f. is defined as rows minus 1 (r-1) times columns minus
1 (c-1).
6. Make the decision: For most tests if the calculated value is larger than the
critical value, we reject the null hypothesis and conclude that the
alternative hypothesis is supported. If the critical value is larger, we
conclude we have failed to reject the null. 55
Tests of significance
• There are two general classes of significance tests:

Parametric hypothesis testing Non-parametric hypothesis testing

• When the data are interval-or ratio- • when data are either ordinal or
scaled (gross national product, nominal
industry sales volume) and sample size
is large •Examples: Chi-square, Kolmogorov-
•It assumes that the data in the study Smirnov test
are drawn from population with
normal (bell-shaped) distributions
and /or normal sampling distribution

•Examples z-test, t-test

56
Tests of significance– Cont’d
• Some guidelines for selecting appropriate statistical method:
Business problem Statistical question to be Possible test of statistical
asked significance
Interval or ratio scale •Is the sample mean significantly •Z-test (if sample is large)
e.g., Compare actual and different from hypothesized •T-test (if sample is small)
hypothetical values of average population mean
salary

Ordinal scale Does the distribution of scores •Chi-square test


E.g. 1: compare actual and for a scale with the categories
expected evaluations excellent, good, fair, and poor
differ from the expected
distribution?
e.g. 2: Determine ordered Does a set of rank-orderings in a •Kolmogorov-Smirnov test
preferences for all brands in sample differ from an expected
product class or hypothetical rank ordering?
Nominal scale •Chi-square test

57
Non-parametric test-2-1
Chi-square (X2 )Test:- /Chi-square distribution is right skewed/
 It is particularly useful in test involving nominal data but can be
used for higher scales.
 Typical are cases where persons, events, or objects are grouped
in two or more nominal categories such as male and female, yes-
no, favor-undecided-against, successful-unsuccessful, agree-
disagree, or class “A,B,C, or D.”
 Focus: test for the significant differences between the observed
distributions of data among categories and the expected
distributions based upon the null hypothesis.
 It tests the goodness of fit of observed distribution with the
expected distribution.
 It must be calculated with actual counts rather than percentages.
 The value of X2 is the measure that expresses the extent of this
difference. The larger the divergence, the larger the X2 value. 58
Non-parametric test---Cont’d
• The formula by which the X2 test is calculated is:
k

X2 = ∑ (Oi-Ei )2
i=1 Ei
In which:
• Oi= observed number of cases categorized in the ith category
• Ei= Expected number of case in the ith category under H0.
• K= the number of categories

59
Non-parametric test---Cont’d---2-3
Example 1: The following table shows the responses to a survey
investigating awareness of HIV/Aids:
Awareness of the No. of respondents Frequency
HIV/Aids
Aware 60 60
unaware 40 40
Total 100 100

• This frequency distributions/one-way frequency table, or one


dimensional table from a sample of 100, suggest that the
majority of the population (60 percent) is aware of [the brand]. Is
the observed difference the result of chance variation or is it
statistically significant?
• X2 =4

60
X2 calculation
Step 1: H0 : the number of respondents aware of disease will equal the number of
respondents unaware of it.
 This would determine the expected distribution. Thus in a sample of 100, 50 people would
be expected to respond yes, or be aware, and 50 would be expected to respond no, or be
unaware.
Step 2: X2
Step 3: Level of significance= 0.05, d.f.= (2-1)=1
Awareness Observed Expected Expected (Oi –Ei ) (Oi -Ei )2
of the frequency probability frequency Ei
HIV/Aids (Oi ) ( Ei )
Aware 60 .5 50 10 100/50=2
Unaware 40 .5 50 -10 100/50=2
Total 100 1 100 0 4

• Compare the computed X2 with the critical X2 value associated with the 0.05, &
d.f. =1, the tabular X2 = 3.84
• Decision: since the calculated X2 is larger than the critical value, the null
hypothesis is rejected.
 The majority of the population (60 percent) is aware of the HIV/Aids.
61
X2 calculation-Cont’d
• Example 2: Let us further analyze the date by subgroups of respondents (sex,
educational level, income level etc). Consider by sex category.
Awareness of Men Women Total
the HIV/Aids

Aware 50 10 60
Unaware 15 25 40
Total 65 35 100

• This is contingency table (cross-tabulation) for awareness of HIV/Aids by sex


• The table suggests that most men are aware of the disease: In our simple
analysis we can conclude that there is a difference in Awareness of the
HIV/Aids between men and women. (it might also be stated that the
Awareness of the HIV/Aids may be associated with sex of respondents.
• Is the observed difference between men and women the result of chance
variation due to random sampling? Or is the discrepancy more than sampling
variation? : the notion of statistical significance.
• Undertake the significance Test: R x C contingency table, where R=row and C=Column
62
X2 calculation-Cont’d
• Two Variables: awareness level and Gender/sex
• In managerial terms the research asks whether men and women
have different levels of awareness.
• Statistics question: “ Is awareness level independent of the
respondent’s sex?
• Hypothesis testing procedures:
• Null hypothesis: There is no difference in the HIV/Aids awareness
level between men and women.
 On the basis of this, calculate the expected observation for each
cell by using the formula: Eij= RiCj/n: Total ith rowxtotal jth
column/n
Where:
Ri = total observed frequency in the ith row
Cj = Total observed frequency in the jth column
n= sample size 63
X2 calculation-Cont’d
• Here is the expected observations in the parentheses:

Awareness level Men Women Total


Aware 50 (39) 10(21) 60
Unaware 15(26) 25(14) 40
Total 65 35 100

• Critical value = 3.84 at the 0.05 probability level with 1 d.f.


• X2 = 22.1 thus the null hypothesis will be rejected
• Note: proper use of the chi-square test requires that each
expected cell frequency have a value of at least 5. If this sample
size requirement is not met, the researcher may take a larger
sample or combine (“collapse”) response categories

64
Exercise
• Suppose you have collected data from 103 ECSC students about information
technology facilities at the College. Here is the summarized observed
distributions from the questionnaire:
Information technology facilities at the College are:
good reasonable Poor Row totals
Undergraduate 63 18 5 86
Postgraduate 6 4 7 17
Column totals 69 22 12 103

• Test the statistical significance about undergraduate and postgraduate of


students' opinion of the information technology facilities (at 0.01 level of
significance).

65
Feedback
Information technology facilities at the College are:
good reasonable Poor Row totals
Undergraduate 63 (58) 18 (18) 5 (10) 86 (83.5%)
Postgraduate 6 (11) 4 (4) 7 (2) 17 (16.5%)
Column totals 69 (67.0%) 22 (21.4%) 12 (11.6%) 103

• Critical value,( d.f. =2, and significance level=0.01)= 9.210


• Calculated X2 = 18.33
• Decision: The association between the type of student and their opinion of the
information technology facilities is extremely unlikely to be explained by chance
alone
• Further analysis by examining the cell values in relation to the row and column
totals: Of the postgraduates, 41.1 percent thought the information technology
facilities were poor. This is high compared with the column totals, which indicate
that only 11.6 percent of total students thought the information technology
facilities were poor.
 Postgraduate students have a poorer opinion of information technology facilities
66
than undergraduate students do.
The test of association: X2 calculation-Cont’d
Example: The personnel manager of a company believes that monthly paid staff
take more time off work through sickness that those staff who are paid weekly
(and do not belong to the company sickness scheme).To test this hypothesis,
the sickness records for 531 randomly selected employees who have been in
continuous employment for the past years were analyzed. The following table
was produced, which placed employees into 3 categories according to how
many days they were off work through sickness during the past year. For
example, 95 monthly paid employees were off sick for less than 5 days.
Table 4.3 Number of days off sick by type of employee
Type of employees Number of days off sick
Less than 5 days 5 to 10 days More than 10 days
Monthly paid 95 47 18
Weekly paid 143 146 112

• Is there any association between type of employee and numbers of days of


sick?

67
The test of association: X2 calculation-Cont’d
• The null and alternative hypotheses are:
H0: There is no association between type of employee and number of days off sick.
H1: There is an association between type of employee and number of days off sick.
• In order to calculate the X2 test statistic it is necessary to determine the expected
values for each category. To do this you first have to work out the row and column
totals as shown by the contingency table:
Type of Number of days of week
employee
Less than 5 5 to 10 days More than 10 Total
days days
Monthly paid 95 47 18 160
Weekly paid 143 146 112 401
Total 238 193 130 561

• You now need to apply some basic ideas of probability to the problem. If an
employee was chosen at random, the probability that he or she was monthly paid
would be 160/561 and the probability that he or she would have been off sick for
less than 5 days is 238/561. Therefore, using the multiplication rule for two
probabilities, the probability that the person is both monthly paid and in ‘less than
68
5 days’ category is 160/561 X 238/561
The test of association: X2 calculation-Cont’d
• Since there are 561 employees in total, the expected number of employees with both these
attributes is:
(160/561) X (238/561) X 561= (160X 238)/561= 67.9
• This could be written as :
Expected value= (Row total x Column Total)/Grand total and is applicable for all cells of a
contingency table. The rest of the expected values can now be worked out---: Calculation of
the chi-square test statistics
O E (O-E) (O-E)2 (O-E)2 /E
95 67.9 27.1 734.41 10.816
47 55.0 -8.00 64.0 1.164
18 37.1 -19.1 364.81 9.833
143 170.1 -27.1 734.41 4.318
146 138.0 8 64.00 0.464
112 92.9 19.10 364.81 3.927
Total 30.522

• So test statistic for this problem is 30.522. Then find the critical from X2 table, which
depends on the degrees of freedom from the table
69
The test of association: X2 calculation-Cont’d

• Degree s of freedom: (number of cloumns-1) X (number of rows-1)


• The critical value for 2 degrees of freedom at 5 % significance
level is 5.991, and at the 0.1% significance level is 13.815.
Therefore, since the test statistic is greater than 13.815, H0 can be
rejected at the 0.1% significance level, and you could conclude
that there does seem to be an association between staff
category and the number of days off sick.

70
The test of association: X2 calculation-Cont’d
• In Addis Ababa a survey was carried out of 171 radio listeners who were asked what
radio station they listened to most during an average week. A summary of their replies
is given in this table, together with their age range.

Age range
Less than 20 20 to 30 Over 30
Radio Fana 22 16 50
FM Addis 6 11 16
Ethiopia/National 35 3 12
radio

a. Is there any evidence that there is an association between age and radio station?
b. By considering the contribution to the value of your test statistic from each cell and
the relative sizes of the observed and expected frequencies in each cell, indicate the
main source of the association, if any exist.

71
The test of association: X2 calculation-Cont’d
A. Chi-squared= 38.126, Reject H0;
B. More people under 20 listen to National radio than expected.

72
The test of association: X2 calculation-Cont’d
In order to assess the effectiveness of a company training program, each employee was
appraised before and after the training. Based on the comparisons of the two
appraisals, each of the 110 production staff were classified according to how well they
had benefited from the training. This classification ranged from ‘worse’, which means
they now perform worse than they did before, to ‘high’, which means they perform
much better than they did before the training. The results of this appraisal can be seen
from the following table, where you will notice that employees have been further
classified by age:
Age of employee Level of improvement
Worse None Some High Total
Below 40 1 5 24 30 60
40+ 4 5 31 10 50
Total 5 10 55 40 110

a. Is there any association between level of improvement at the job and age?
b. What is the 95 % confidence interval for the percentage of employees who showed a
73
high level of improvement at their job?
The test of association: X2 calculation-Cont’d

Note: Multivariate analysis

Example: To examine the impacts of family income, family size, and


location on family food expenditures
• Multiple regression

74
7.5. Data Interpretation- 2-2
• Interpretation is the process by which you put your own
meaning on the data you have collected and analyzed, and
compare that meaning with those advanced by others.
• It refers to the process of giving meaning to the data and
spelling out the implications in relation to the research
questions and objectives.
• Tracing back to objectives, questions and problems
• It requires making good arguments; claims, reasons, and
evidence)
• Example: over last ten years, the local temperature has
increased by 10 degree C---Claims. This is because of extensive
deforestation (reasons). Evidences
• Ex2: In Woreda Y of rural area, students could not sit for exam
(claim) because of peak agricultural period (reason). The
responses show that only 25% of students sat for exam--
Evidence. 75
7.6. The application of SPSS for data processing and analysis

76
Summary of data analysis
• Levels of quantitative analysis
 Descriptive statistics: variable frequencies, averages, ranges
o For nominal or ordinal data: proportions, percentages, ratios
o For interval or ratio data:
• Measures of central tendency: mean ( total sum of values divided by
the number of cases), median ( the value of the middle case), mode
(the most frequently occurring value)
• Measures of dispersion: range (the difference between the highest
and lowest values, standard deviation( the square root of the mean of
the squared deviations from the mean)
 Inferential statistics: assessing the significance of your data and
results
 Simple inter-relationships: cross-tabulation or correlation between
two variables
 Multivariate analysis: studying the linkages between more than two
77
variables

You might also like