Lecture Note CH 7 Ed Ok1
Lecture Note CH 7 Ed Ok1
2
7.1. Introduction
• After data have been collected, they must be organized and analysed using
various statistical tools and techniques
Flow diagram of data analysis and interpretation
Data Analysis
Data Collection
Interpretation
of results
5
Data processing-cont’d
• If a substantial number of questions-say, 25 percent of the items in the
questionnaire- have been left unanswered, it may be advisable to throw the
questionnaire and not include in the data set for analysis.
Important to mention the number of returned but unused responses due to
excessive missing data in the final report
• If, however,, only two or three items are left blank in a questionnaire with, say,
30 or more items, a decision must be made about how these blank responses
are to be handled.
Assign the midpoint in the scale for an interval-scaled item.
Ignore the blank responses when the analyses are done (this, of course, will
reduce the sample size whenever that variable is involved), the best way of
handling missing items in case of sample size is large.
Assign to the item the mean value of the responses of all those who have
responded to that particular item.
• Treat the ‘don’t know’ responses as that of missing items
6
Data processing- Cont’d
• The next step is to code the responses using numerical codes.( coding at the time of
designing the questionnaire or after the data collection.
• Coding: the process of identifying and classifying each answer with a numerical
score or other character symbol.
1.• Age
The(years) 2. Education
responses 3. Job
to demographic level can4.be
variables Sex 5. Work shift
coded as follows: 6.
Employment
status
[1] under 25 [1] high school [1] manager [1] M [1] first shift [1] part time
[2] 25-35 [2] Diploma [2] supervisor [2] F [2] second [2] full time
[3] 36-45 [3] Bachelor’s [3] Clerk [3] third
degree
[4] 46-55 [4] Master’s [4] Secretary
degree
[5] over 55 [5] doctoral [5] Technician
degree
[6] Other [6] other
(specify] (specify)
7
Data processing- Cont’d
Example: The following questions are used to measure Involvement and satisfaction variables
• To what extent would you agree with the following statements, on the scale of 1 to 7, 1 denoting very
low agreement, and 7 denoting very high agreement?
1 2 3 4 5 6 7
6. The major happiness of my life comes from my job
7. Times at work flies by quickly
8. I live, eat, and breathe my job
9. My work is fascinating
10. My work gives me a sense of accomplishment
11. My supervisor praises good work
12. The opportunities for advancement are very good here
13. My coworkers are very stimulating
14. People can live comfortable with their pay in this organization
15. I get a lot of cooperation at the workplace
16. My supervisor is not very capable
17. Most things in life are more important than work
18. Working here is a drag
19. The promotion policies here are very unfair
20. My pay is barely adequate to take care of my expenses
21. My work is not the most important part of my life 8
Data processing- Cont’d
• The purpose of coding responses from open-ended questions is to
reduce the large number of individual responses to a few general
categories of answers that can be assigned numerical score.
Note: The usual reason for using open-ended questions is that the research
has no clear hypothesis regarding the answers.
9
Data processing- Cont’d
Categorization:
• It involves categorizing the variables such that the several items measuring a
concept are all grouped together.
• Responses to some of the negatively worded questions have to be reversed so
that all answers are in the same direction.
This can be done on the computer through a RECODE.
Entering data:
• Raw data can be entered (manually or using scanner sheet-a machine-readable
form) through any soft ware program. For instance, the SPSS Data Editor:
It looks like a spread sheet
Can enter, edit, view of the contents of the data file
Each row of the editor represents a case, and each column represents a variable
All missing values will appear with a period (dot) in the cell.
Data cleaning: check for wrongly coded variables- a check to make sure that all
codes are legitimate. For example, if sex is coded 1= male and 2=male and a 3
code is found, it is obvious that a mistake has been made that requires an
adjustment.
10
Data processing- Cont’d
. Data summary code sheet
Responde Total
nts
Part I Part II
Living Fulfilling
Sex together needs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 0 1 4 5 5 5 4 4 4 5 4 4 3 4 5 4 3 63
2 1 10 2 2 2 3 2 1 1 2 2 2 2 3 2 2 2 3 31
228
229
11
7.3. Data Presentation
• Data in raw form are usually not easy to use for decision making
• Some type of organization is needed
• Table
• Graph
• Data presentation: The process of transforming a mass of raw data into
tables and charts-as a part of making sense of the data.
• refers to the preparation of data in a manner that could be used by
general audience
• Tables:
They can be used with just about all types of numerical data.
• Graphical
• The type of graph to use depends on the variable being summarized
12
Data presentation: The Frequency
Distribution Table
(Variables are
categorical)
13
Data presentation-Cont’d
• Tables:
Example for tabulation of data: A small survey was carried out into the mode of travel to work by taking a
random sample of 20 employed adults.
14
• How would you classify this data?
Data presentation-Cont’d
• This data is categorical (nominal ) since mode of travel does not
have a numerical value. This information would be better
displayed as a frequency table:
Table 7.1: frequency Mode of travel
Mode of travel Frequency Relative frequency (%)
Car 9 45
Bus 4 20
Cycle 3 15
Walk 2 10
Train 2 10
Total 20 100
15
Graphical
Presentation of Data
(continued)
16
Tables and Graphs for Categorical
Variables
Categorical Data
Frequency
Distribution Table Bar Chart Pie Chart
17
Data presentation-Cont’d
• Diagrammatic representation of data (bar charts, pie charts, histogram, line graphs,
frequency polygon)
• Bar charts:
A bar chart is a graph that shows the frequency distribution of a variable.
they can be used with nominal and with discrete data
Bars should be of equal width, with the height of the bars representing the frequency
(height of the bar is proportional to frequency) or the amount for each separate
category.
For each category a vertical bar is drawn
There is a gap between each bar.
Types of bar charts: simple bar chart, multiple bar chart, component bar chart
A simple bar chart: shows the total of each category
A multiple bar chart is used when you are interested in changes in the components but
the totals are of no interest
A component bar chart: this helps to compare totals and seeing how the totals are
made up a component bar chart.
18
Simple Bar Chart Example
Hospital Number
Unit of Patients
Cardiac Care 1,052
Emergency 2,245 Hospital Patients by Unit
Intensive Care 340 5000
3000
2000
1000
0
Cardiac
Emergency
Intensive
Surgery
Maternity
Care
Care
19
A multiple bar Chart Example
• Sales by quarter for three sales territories:
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Ea st 20.4 27.4 59 20.4
W e st 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
60
50
40
East
30 West
North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
20
A Component bar Chart Example
• Sales by quarter for three sales territories:
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Ea st 20.4 27.4 59 20.4
W e st 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
160
140
120
100
North
80 West
60 East
40
20
0
1st Qtr 2nd Qtr 3rd Qrt 4th Qtr
By Deribe A. 21
Data Presentation
Pie chart:
• Presents data as segments of the whole pie.
• Each category is represented by a segment of a circle.
• The segments are presented in terms of percentages
• The size of each segment reflects the frequency of that category
and can be represented as an angle.
22
Pie Chart Example
Hospital Number % of
Unit of Patients Total
Hospital Patients by Unit
Cardiac Care 1,052 11.93
Emergency 2,245 25.46 Cardiac Care
12%
Intensive Care 340 3.86
Maternity 552 6.26
Surgery 4,630 52.50
Emergency
Surgery 25%
53%
Intensive Care
(Percentages are 4%
rounded to the Maternity
nearest percent) 6%
23
Graphs to Describe
Numerical Variables
Numerical Data
Histogram
24
Graphs for Time-Series Data
• A line chart (time-series plot) is used to show
the values of a variable over time
25
Line Chart Example
350
300
Thousands of subscribers
250
200
150
100
50
0
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
26
Data presentation-Cont’d
A frequency distribution (or frequency table) is a set of data which
records the number of times a particular vlaue of a variable, or
range of values of a variable, occurs
Example:
28
Histogram Example
Interval Frequency
His togram : Daily High Te m pe rature
10 but less than 20 3
20 but less than 30 6 7 6
30 but less than 40 5
40 but less than 50 4
6 5
50 but less than 60 2 5 4
Frequency
4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 10 1020 20
30 4030 50 40 6050 70 60
bars) Temperature in Degrees
29
Data presentation-Cont’d
Create:
1. A pie chart for sales in 2009
2. A simple bar chart of total sales for each of the three years
3. A multiple bar chart for sales by year and department
4. A component bar chart
5. A percentage bar chart
6. A line graph for the total sales over the three years
30
Data presentation-Cont’d
32
7.4. Data Analysis: Basic concepts
Some terminologies:
• Data analysis: is the application of logic to understand and
interpret the data that have been collected about a subject. It
involves determining consistent patterns and summarizing the
appropriate details revealed in the investigation.
statistical analysis may range from portraying a simple frequency
distribution to very complex multivariate analysis, such as multiple
regression.
• Univariate analysis is the examination of the distribution of cases
on only one variable at a time. Example:
Awareness of the No. of respondents Frequency
HIV/Aids
Aware 60 60
unaware 40 40
Total 100 100
33
Data Analysis: Basic concepts
Some terminologies:
• Bivariate analysis: an analysis of the relationship between two
variables. Example:
Awareness of the Men Women Total
HIV/Aids
Aware 50 10 60
Unaware 15 25 40
Total 65 35 100
1 50,000,000
2 150,000,000
3 40,000,000
4 60,000,000
Total = 300,000,000
Arithmetic
ArithmeticMean
Mean==300,000,000
300,000,000//44 ==75,000,000
75,000,000
40
Example: Measures of Central Tendency
(Median)
The Median is the midpoint of the distribution of values under
consideration
41
Example: Measures of Central Tendency
(Mode)
The Mode is the value that occurs most frequently in the
distribution of values under consideration
42
Descriptive analysis-Cont’d
B. Measures of dispersion
• An average can represent a series only as best as a single figure can, but it certainly cannot
reveal the entire story of any phenomenon under study
• Shows the degree by which numerical data tend to spread around an average value/mean
.
• Averages do not tell anything about the scatterness of observations within the distribution.
• In order to measure the degree of scatter, the statistical device called measures of
dispersion are calculated
• Important measures of dispersion are:
Range = highest value – lowest value
It shows the difference b/n the highest value and the lowest value, hence it is
the weakest measure of dispersion
• Mean deviation
– First calculate the mean, then deduct the mean from each value in the group
and divide the result by the number of values
• Variance
– First calculate the mean, then deduct the mean from each value in the group
square the result and divide the result by the number of values
• Standard deviation
– The most reliable measurement of the degree to which the data is spread
around the mean 43
Justifications for ‘Dispersion’ measures
• Averages are representatives of a frequency distribution but they fail to give a
complete picture about the distribution. Suppose that we have the distribution
of the yields (Kg per plot) of two wheat varieties from 5 plots each. The
distribution may be as follows:
Variety I 45 42 42 41 40
Variety II 54 48 42 33 30
• It can be seen that the mean yield for both varieties is 42 Kg. But we cannot say
that the performance of the two varieties are the same. There is greater
uniformity of yields in the first variety whereas there is more variability in the
yields of the second variety. The first variety may be preferred since it is more
consistent in yield performance
• From the above example, it is obvious that a measure of central tendency alone
is not sufficient to describe a frequency distribution.
• In addition to it we should have a measure of scatterness of observations
• The scatterness or variation of observations from their average is called the
dispersion.
44
Descriptive analysis-Cont’d
C. Measures of asymmetry (skewness)-it measures the shape of
distribution
• The shape of the distribution is said to be symmetric if the observations are
balanced, or evenly distributed, about the center.
Symmetric Distribution
10
9
8
7
Frequency
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9
45
Descriptive analysis: Measures of asymmetry
(continued)
• The shape of the distribution is said to be skewed if the
observations are not symmetrically distributed around the
center.
Positively Skewed Distribution
Frequency
right in the direction of positive values. 6
0
1 2 3 4 5 6 7 8 9
negative values. 8
Frequency
0
1 2 3 4 5 6 7 8 9 46
Descriptive analysis-cont’d
D. Measures of relationship:
• Need to determine whether there is a
relationship between variables
47
Correlation (cont.)
• Magnitude
• Direction
48
Data Analysis-Cont’d: Inferential analysis
• Inferential analysis:
Provides procedures to draw inferences about a population from a sample.
It is concerned with the various tests of significance for testing hypotheses in
order to determine with what validity data can be said to indicate some
conclusion or conclusions.
It is concerned with the estimation of population values.
It is mainly on the basis of inferential analysis that the task of interpretation is
performed.
Examples:
• The demand for a new Product X based on a sample conducted in Region Y
The general election result based on a representative survey of voters in electoral district Z
• ANOVA (F-ratio)
Non-parametric test Parametric test
50
Statistical significance- 2-2
• Following the sampling-theory approach, we accept or reject a
hypothesis on the basis of sampling information alone.
• Since any sample will almost surely vary somewhat from its
population, we must judge whether these differences are
statistically significant or insignificant.
A difference has statistical significance if there is good reason to
believe the difference does not represent random sampling
fluctuations only.
• Tests of significance needs the understanding of the logic of
hypothesis testing: two kinds of hypotheses are used-null and
alternative
• Alternative hypotheses correspond with two-tailed and one-
tailed tests.
• If the null hypothesis is rejected, an alternative hypothesis is
accepted. 51
Statistical significance-Cont’d
• The null hypothesis is tested at a particular significance level.
This relates to area (or probability) in the tail of the distribution
being used for the test.
This area is called the critical region, and if the test statistic lies in
the critical region, you would infer that the result is unlikely to
have occurred by chance. You would then reject the null
hypothesis.
For example, if the 5% level of significance was used and the null
hypothesis was rejected, you would say that H0 had been rejected
at the 5% (or the .005) significance level, and the result was
significant.
The more significant a result, the more likely that it represents
something genuine.
52
Inferential Analysis
Testing Hypothesis:
• A hypothesis is established, it is rejected or accepted/fails to be
rejected, based on sample data collected.
• Since any sample will almost surely vary from somewhat from its
population, we must judge whether these differences are
statistically significant or insignificant.
• A difference has statistical significance if there is good reason to
believe the difference does not represent random sampling
fluctuations only.
Statistical testing procedures:
1. State the null hypothesis: while the researcher is usually
interested in testing a hypothesis of change or differences, the
null hypothesis is always used for statistical testing purposes.
53
Inferential Analysis
Statistical testing procedures---Cont’d
2. Choose the statistical test: on the basis of measurement scale
used & other factors
3. Select the desired level of significance (α): The choice of the level
of significance should be made before we collect data. The most
common level is .05, although .01 is also widely used.
4. Compute the calculated difference (XX) value: After the data are
collected, use the formula for the appropriate significance test
to obtain the calculated value.
5. Obtain the critical test value: After we compute the calculated t,
z, X2, or other measure, we must look up the critical value in the
appropriate table for that distribution. The critical value is the
criterion that defines the region of rejection from the region of
acceptance of the null hypothesis.
54
Inferential Analysis
Statistical testing procedures---Cont’d
Note: There is a different distribution for X2 for each number of degrees of
freedom (d.f), defined as the number of categories in the classification
minus one (k-1).
For instance, for four numbers with their sum is 10, there will be a freedom
of choice for the first three digits, but the fourth value is not free to vary
(4,2,1, X: sum=10)
• It is the number of observations that can be varied without changing the
restraints or assumptions associated with a numerical system.
With chi-square contingency tables of the two samples or k-sample
variety, we have both rows and columns in the cross-classification table.
In that instance, d.f. is defined as rows minus 1 (r-1) times columns minus
1 (c-1).
6. Make the decision: For most tests if the calculated value is larger than the
critical value, we reject the null hypothesis and conclude that the
alternative hypothesis is supported. If the critical value is larger, we
conclude we have failed to reject the null. 55
Tests of significance
• There are two general classes of significance tests:
• When the data are interval-or ratio- • when data are either ordinal or
scaled (gross national product, nominal
industry sales volume) and sample size
is large •Examples: Chi-square, Kolmogorov-
•It assumes that the data in the study Smirnov test
are drawn from population with
normal (bell-shaped) distributions
and /or normal sampling distribution
56
Tests of significance– Cont’d
• Some guidelines for selecting appropriate statistical method:
Business problem Statistical question to be Possible test of statistical
asked significance
Interval or ratio scale •Is the sample mean significantly •Z-test (if sample is large)
e.g., Compare actual and different from hypothesized •T-test (if sample is small)
hypothetical values of average population mean
salary
57
Non-parametric test-2-1
Chi-square (X2 )Test:- /Chi-square distribution is right skewed/
It is particularly useful in test involving nominal data but can be
used for higher scales.
Typical are cases where persons, events, or objects are grouped
in two or more nominal categories such as male and female, yes-
no, favor-undecided-against, successful-unsuccessful, agree-
disagree, or class “A,B,C, or D.”
Focus: test for the significant differences between the observed
distributions of data among categories and the expected
distributions based upon the null hypothesis.
It tests the goodness of fit of observed distribution with the
expected distribution.
It must be calculated with actual counts rather than percentages.
The value of X2 is the measure that expresses the extent of this
difference. The larger the divergence, the larger the X2 value. 58
Non-parametric test---Cont’d
• The formula by which the X2 test is calculated is:
k
X2 = ∑ (Oi-Ei )2
i=1 Ei
In which:
• Oi= observed number of cases categorized in the ith category
• Ei= Expected number of case in the ith category under H0.
• K= the number of categories
59
Non-parametric test---Cont’d---2-3
Example 1: The following table shows the responses to a survey
investigating awareness of HIV/Aids:
Awareness of the No. of respondents Frequency
HIV/Aids
Aware 60 60
unaware 40 40
Total 100 100
60
X2 calculation
Step 1: H0 : the number of respondents aware of disease will equal the number of
respondents unaware of it.
This would determine the expected distribution. Thus in a sample of 100, 50 people would
be expected to respond yes, or be aware, and 50 would be expected to respond no, or be
unaware.
Step 2: X2
Step 3: Level of significance= 0.05, d.f.= (2-1)=1
Awareness Observed Expected Expected (Oi –Ei ) (Oi -Ei )2
of the frequency probability frequency Ei
HIV/Aids (Oi ) ( Ei )
Aware 60 .5 50 10 100/50=2
Unaware 40 .5 50 -10 100/50=2
Total 100 1 100 0 4
• Compare the computed X2 with the critical X2 value associated with the 0.05, &
d.f. =1, the tabular X2 = 3.84
• Decision: since the calculated X2 is larger than the critical value, the null
hypothesis is rejected.
The majority of the population (60 percent) is aware of the HIV/Aids.
61
X2 calculation-Cont’d
• Example 2: Let us further analyze the date by subgroups of respondents (sex,
educational level, income level etc). Consider by sex category.
Awareness of Men Women Total
the HIV/Aids
Aware 50 10 60
Unaware 15 25 40
Total 65 35 100
64
Exercise
• Suppose you have collected data from 103 ECSC students about information
technology facilities at the College. Here is the summarized observed
distributions from the questionnaire:
Information technology facilities at the College are:
good reasonable Poor Row totals
Undergraduate 63 18 5 86
Postgraduate 6 4 7 17
Column totals 69 22 12 103
65
Feedback
Information technology facilities at the College are:
good reasonable Poor Row totals
Undergraduate 63 (58) 18 (18) 5 (10) 86 (83.5%)
Postgraduate 6 (11) 4 (4) 7 (2) 17 (16.5%)
Column totals 69 (67.0%) 22 (21.4%) 12 (11.6%) 103
67
The test of association: X2 calculation-Cont’d
• The null and alternative hypotheses are:
H0: There is no association between type of employee and number of days off sick.
H1: There is an association between type of employee and number of days off sick.
• In order to calculate the X2 test statistic it is necessary to determine the expected
values for each category. To do this you first have to work out the row and column
totals as shown by the contingency table:
Type of Number of days of week
employee
Less than 5 5 to 10 days More than 10 Total
days days
Monthly paid 95 47 18 160
Weekly paid 143 146 112 401
Total 238 193 130 561
• You now need to apply some basic ideas of probability to the problem. If an
employee was chosen at random, the probability that he or she was monthly paid
would be 160/561 and the probability that he or she would have been off sick for
less than 5 days is 238/561. Therefore, using the multiplication rule for two
probabilities, the probability that the person is both monthly paid and in ‘less than
68
5 days’ category is 160/561 X 238/561
The test of association: X2 calculation-Cont’d
• Since there are 561 employees in total, the expected number of employees with both these
attributes is:
(160/561) X (238/561) X 561= (160X 238)/561= 67.9
• This could be written as :
Expected value= (Row total x Column Total)/Grand total and is applicable for all cells of a
contingency table. The rest of the expected values can now be worked out---: Calculation of
the chi-square test statistics
O E (O-E) (O-E)2 (O-E)2 /E
95 67.9 27.1 734.41 10.816
47 55.0 -8.00 64.0 1.164
18 37.1 -19.1 364.81 9.833
143 170.1 -27.1 734.41 4.318
146 138.0 8 64.00 0.464
112 92.9 19.10 364.81 3.927
Total 30.522
• So test statistic for this problem is 30.522. Then find the critical from X2 table, which
depends on the degrees of freedom from the table
69
The test of association: X2 calculation-Cont’d
70
The test of association: X2 calculation-Cont’d
• In Addis Ababa a survey was carried out of 171 radio listeners who were asked what
radio station they listened to most during an average week. A summary of their replies
is given in this table, together with their age range.
Age range
Less than 20 20 to 30 Over 30
Radio Fana 22 16 50
FM Addis 6 11 16
Ethiopia/National 35 3 12
radio
a. Is there any evidence that there is an association between age and radio station?
b. By considering the contribution to the value of your test statistic from each cell and
the relative sizes of the observed and expected frequencies in each cell, indicate the
main source of the association, if any exist.
71
The test of association: X2 calculation-Cont’d
A. Chi-squared= 38.126, Reject H0;
B. More people under 20 listen to National radio than expected.
72
The test of association: X2 calculation-Cont’d
In order to assess the effectiveness of a company training program, each employee was
appraised before and after the training. Based on the comparisons of the two
appraisals, each of the 110 production staff were classified according to how well they
had benefited from the training. This classification ranged from ‘worse’, which means
they now perform worse than they did before, to ‘high’, which means they perform
much better than they did before the training. The results of this appraisal can be seen
from the following table, where you will notice that employees have been further
classified by age:
Age of employee Level of improvement
Worse None Some High Total
Below 40 1 5 24 30 60
40+ 4 5 31 10 50
Total 5 10 55 40 110
a. Is there any association between level of improvement at the job and age?
b. What is the 95 % confidence interval for the percentage of employees who showed a
73
high level of improvement at their job?
The test of association: X2 calculation-Cont’d
74
7.5. Data Interpretation- 2-2
• Interpretation is the process by which you put your own
meaning on the data you have collected and analyzed, and
compare that meaning with those advanced by others.
• It refers to the process of giving meaning to the data and
spelling out the implications in relation to the research
questions and objectives.
• Tracing back to objectives, questions and problems
• It requires making good arguments; claims, reasons, and
evidence)
• Example: over last ten years, the local temperature has
increased by 10 degree C---Claims. This is because of extensive
deforestation (reasons). Evidences
• Ex2: In Woreda Y of rural area, students could not sit for exam
(claim) because of peak agricultural period (reason). The
responses show that only 25% of students sat for exam--
Evidence. 75
7.6. The application of SPSS for data processing and analysis
76
Summary of data analysis
• Levels of quantitative analysis
Descriptive statistics: variable frequencies, averages, ranges
o For nominal or ordinal data: proportions, percentages, ratios
o For interval or ratio data:
• Measures of central tendency: mean ( total sum of values divided by
the number of cases), median ( the value of the middle case), mode
(the most frequently occurring value)
• Measures of dispersion: range (the difference between the highest
and lowest values, standard deviation( the square root of the mean of
the squared deviations from the mean)
Inferential statistics: assessing the significance of your data and
results
Simple inter-relationships: cross-tabulation or correlation between
two variables
Multivariate analysis: studying the linkages between more than two
77
variables