Chapter 10 Data Analysis-Quantitative
Chapter 10 Data Analysis-Quantitative
Data Analysis-Quantitative
1
Data Analysis
Quantitative Analysis
2
Data Analysis
Descriptive Statistics
• Numeric data collected in a research project can be
analyzed quantitatively using statistical tools in two
different ways.
• Descriptive analysis refers to statistically describing,
aggregating, and presenting the constructs of interest
or associations between these constructs.
• Inferential analysis refers to the statistical testing of
hypotheses (theory testing).
• Much of today’s quantitative data analysis is conducted
using SPREAD SHEET software programs such as SPSS
or SAS or WEKA,…
• Readers are advised to familiarize themselves with one
of these programs for understanding the concepts
described in this discussion. 3
Data Analysis
Data Preparation
• In research projects, data may be collected
from a variety of sources:
• Mail-in surveys,
• Interviews, pretest or posttest
experimental data,
• Observational data, and so forth.
• This data must be converted into a machine-
readable, numeric format, such as in a
spreadsheet or a text file, so that they can be
analyzed by computer programs like MS
EXCEL or SPSS or SAS or WEKA…
4
Data Analysis
Data preparation usually follows the following steps.
Data Coding
• Coding is the process of converting data into numeric
format.
• A codebook should be created to guide the coding
process.
• A codebook is a comprehensive document containing :
• Detailed description of each variable in a research
study,
• Items or measures for that variable,
• The format of each item (numeric, text, etc.),
• The response scale for each item (i.e., Whether it
is measured on a nominal, ordinal, interval, or
ratio scale;
• Whether such scale is a five-point, seven-point, or
some other type of scale), and
• How to code each value into a numeric format.
5
Data Analysis
For Instance,
• If we have a measurement item on a
seven-point Likert scale with anchors
ranging from “strongly disagree” to
“strongly agree”, we may code that item as
1 for strongly disagree, 4 for neutral, and
7 for strongly agree.
6
Data Analysis
Data Entry
• Coded data can be entered into a spreadsheet,
database, text file, or directly into a statistical
program like SPSS.
• Most statistical programs provide a data editor for
entering data.
• However, these programs store data in their own
native format (e.g., SPSS stores data as .sav files),
which makes it difficult to share that data with
other statistical programs.
• Hence, it is often better to enter data into a
spreadsheet or database, where they can be
reorganized as needed, shared across programs,
and subsets of data can be extracted for analysis.
7
Data Analysis
Univariate Analysis
• Univariate Analysis, or analysis of a single
variable, refers to a set of statistical
techniques that can describe the general
properties of one variable.
Univariate statistics include:
1. Frequency distribution,
3. Dispersion
8
Data Analysis
i) Frequency Distribution
• The frequency distribution of a variable is a summary of
the frequency (or percentages) of individual values or
ranges of values for that variable.
For Instance,
• We can measure how many times a sample of respondents
attend religious services (as a measure of their
“religiosity”) using a categorical scale:
• Never,
• Once per year,
• Several times per year,
• About once a month,
• Several times per month,
• Several times per week, and
• An optional category for “did not answer.”
9
Data Analysis
Figure 10.1: Frequency Distribution
10
Data Analysis
• With very large samples where observations
are independent and random, the frequency
distribution tends to follow a plot that looked
like a bell-shaped curve (a smoothed bar chart
of the frequency distribution).
• Most observations are clustered toward the
center of the range of values, and fewer and
fewer observations toward the extreme ends
of the range.
• Such a curve is called a normal
distribution.
11
Data Analysis
12
Data Analysis
ii. Central Tendency
• It is an estimate of the center of a distribution of
values.
• There are three major estimates of central
tendency: Mean, Median, and Mode.
• The arithmetic mean (often simply called
the “mean”) is the simple average of all
values in a given distribution.
• Consider a set of eight test scores: 15, 22,
21, 18, 36, 15, 25, and 15.
• The arithmetic mean of these values is
(15 + 20 + 21 + 20 + 36 + 15 + 25 + 15)/8 =
20.875.
13
Data Analysis
• Geometric Mean (nth root of the product of n
numbers in a distribution) and
• Harmonic Mean (the reciprocal of the arithmetic
means of the reciprocal of each value in a
distribution).
• However, these types of means are not popular in
statistical analysis of social research data.
• GM= n
X 1 X 2 ... X n
( x
1
N )
• HM= i
14
Data Analysis
The Median
• The second measure of central tendency, the
median, is the middle value within a range of
values in a distribution.
• This is computed by sorting all values in a
distribution in increasing order and
selecting the middle value.
• In case there are two middle values (if
there is an even number of values in a
distribution), the average of the two
middle values represent the median.
• In the above example, the sorted values
are: 15, 15, 15, 18, 22, 23, 25, 36.
• The two middle values are 18 and 22, and
hence the median is (18 + 22)/2 = 20.
15
Data Analysis
The Mode
• Lastly, the mode is the most frequently
occurring value in a distribution of values.
• In the previous example, the most
frequently occurring value is 15, which is
the mode of the above set of test scores.
• Note that any value that is estimated
from a sample, such as mean, median,
mode, or any of the later estimates are
called a statistic.
16
Data Analysis
iii) Dispersion
• Dispersion refers to the way values are
spread around the central tendency, for
example, how tightly or how widely the
values are clustered around the mean.
• Two common measures of dispersion
are:
• the range and
• standard deviation.
17
Data Analysis
a) Range
• The range is the difference between the
highest and lowest values in a distribution.
• The range in our previous example is 36-15 =
21.
• If the maximum value is raised to 85 in the
above distribution while the other vales
remained the same, the range would be 85-15 =
70.
18
Data Analysis
b) Standard Deviation
• Standard deviation, the second measure of
dispersion, corrects for such outliers by using a
formula that takes into account how close or
how far each value from the distribution mean:
• σ is the standard deviation,
• xi is the ith observation (or value),
• μ is the arithmetic mean,
• n is the total number of observations,
and
• Σ means summation across all
observations.
19
Data Analysis
20
Data Analysis
c)Variance
• The square of the standard deviation is called the
variance of a distribution.
• In a normally distributed frequency distribution, it
is seen that
• 68% of the observations lie within one
standard deviation of the mean (μ + 1σ),
• 95% of the observations lie within two
standard deviations (μ + 2 σ), and
• 99.7% of the observations lie within three
standard deviations (μ + 3 σ).
21
Data Analysis
Bivariate Analysis
• Bivariate Analysis examines how two
variables are related to each other.
• The most common bivariate statistic is the
bivariate correlation (often, simply called
“correlation”), which is a number between
-1 and +1 denoting the strength of the
relationship between two variables.
• Let’s say that we wish to study how age is
related to self-esteem in a sample of 20
respondents, i.e., as age increases, does self-
esteem increase, decrease, or remains
unchanged.
22
Data Analysis
• If self-esteem increases, then we have a
positive correlation between the two variables,
• if self-esteem decreases, we have a negative
correlation, and
• if it remains the same, we have a zero
correlation.
• To calculate the value of this correlation, consider
the hypothetical dataset shown below.
23
Hypothetical Data on Age and Self-Esteem
24
Data Analysis
26
Data Analysis
• Where rxy is the correlation, x and y are the
sample means of x and y, and sx and sy are
the standard deviations of x and y.
• The manually computed value of correlation
between age and self-esteem, using the
above formula is 0.79.
• This figure indicates that age has a
strong positive correlation with self-
esteem, i.e., self-esteem tends to
increase with increasing age, and
decrease with decreasing age.
27
Data Analysis
28
Data Analysis
• The bivariate scatter plot in the right panel of is
essentially a plot of self-esteem on the vertical axis
against age on the horizontal axis.
• This plot roughly resembles an upward sloping line (i.e.,
positive slope), which is also indicative of a positive
correlation.
• If the two variables were negatively correlated, the
scatter plot would slope down (negative slope), implying
that an increase in age would be related to a decrease in
self-esteem and vice versa.
• If the two variables were uncorrelated, the scatter plot
would approximate a horizontal line (zero slope),
implying than an increase in age would have no
systematic bearing on self-esteem.
29
Data Analysis
• H0: r = 0
• H1: r ≠ 0
30
Data Analysis
36
Data Analysis
Cross-Tab/Contingency Table
38
Data Analysis
39
Data Analysis
Is this pattern real or “statistically
significant”?
In other words, do the above frequency counts
differ from that that may be expected from
pure chance?
To answer this question, we should
compute the expected count of
observation in each cell of the 2 x 3
cross-tab matrix.
This is done by multiplying the marginal
column total and the marginal row
total for each cell and dividing it by the
total number of observations.
40
Data Analysis
For Example,
• For the male/A grade cell, expected
count = 5 * 10 / 20 = 2.5.
41
Data Analysis
43
Data Analysis
Inferential Statistics
• Inferential statistics are the statistical
procedures that are used to reach conclusions
about associations between variables.
• They differ from descriptive statistics in that
they are explicitly designed to test hypotheses.
• Numerous statistical procedures fall in this
category, most of which are supported by
modern statistical software such as SPSS and
SAS.
• Readers are advised to consult a formal text on
statistics or take a course on statistics for
more advanced procedures.
44
Data Analysis
Testing for Significance
• Having formulated the hypothesis, the next step is its
validity at a certain level of significance.
• The confidence with which a null hypothesis is accepted
or rejected depends upon the significance level.
• A significance level of say 5% means that the risk of
making a wrong decision is 5%.
• The researcher is likely to be wrong in accepting false
hypothesis or rejecting a true hypothesis by 5 out of
100 occasions.
• A significance level of say 1% means, that the
researcher is running the risk of being wrong
in accepting or rejecting the hypothesis is one
of every 100 occasions.
• Therefore, a 1% significance level provides
greater confidence to the decision than 5%
significance level.
45
Data Analysis
One-tailed and two-tailed tests
• A hypothesis test may be one-tailed or
two-tailed.
a) One Tailed Test
• In one-tailed test the test-statistic for
rejection of null hypothesis falls only in
one-tailed of sampling distribution
curve.
46
Data Analysis
Example 2
• A tyre company claims that mean
life of its new tyre is 15,000 km.
47
Data Analysis
b) A Two-tailed Test
• It is one in which the test statistics leading to
rejection of null hypothesis falls on both tails of
the sampling distribution curve as shown.
a) Degree of Freedom
• a+ b/2 =5.
• Fix a=3, b has to be 7.
• Therefore, the degree of freedom is 1.
49
Data Analysis
b) Select Test Criteria
• If the hypothesis pertains to a larger
sample (30 or more), the Z-test is used.
• When the sample is small (less than 30),
the T-test is used.
C) Carry Out Computation
50
Data Analysis
d) Make Decisions
51
Data Analysis
Assumptions about Parametric and Non-Parametric Test
1) Observations in the population are normally distributed.
2) Observations in the population are independent to each
other.
3) Population should posses' homogeneous characteristics.
4) Samples should be drawn using simple random sampling
techniques.
5) To use T test sample size should be less than 30.
6) To use F test sample size should be less than 30.
7) To use Z test sample size should be more than 30.
8) To use chi square minimum number of observation should
be 5.
52
Data Analysis
a) Parametric Test
• Parametric tests are more powerful.
• The data in this test is derived from interval and
ratio measurement.
• In parametric tests, it is assumed that the data
follows normal distributions. Examples of
parametric tests are (a) Z-Test, (b) T-Test and (c) F-
Test.
• Observations must be independent i.e., selection of
any one item should not affect the chances of
selecting any others be included in the sample.
53
Data Analysis
b) Non-Parametric Test
• Non-parametric tests are used to test the hypothesis with
nominal and ordinal data.
• We do not make assumptions about the shape of
population distribution.
• These are distribution-free tests.
• The hypothesis of non-parametric test is concerned with
something other than the value of a population parameter.
• Easy to compute. There are certain situations particularly
in marketing research, where the assumptions of
parametric tests are not valid. Example: In a parametric
test, we assume that data collected follows a normal
distribution. In such cases, non-parametric tests are used.
• Examples of non-parametric tests are:
a)Binomial test
b)Chi-Square test
c)Mann-Whitney U test
d)Sign test.
54
Data Analysis
Binomial test
A binominal test is used when the population
has only two classes such as male, female;
buyers, non-buyers, success, failure etc.
All observations made about the population
must fall into one of the two tests.
The binomial test is used when the sample
size is small.
Advantages of non parametric test
i) They are quick and easy to use.
55
Data Analysis
Disadvantages of non-parametric test
• Non-Parametric test involves the greater
risk of accepting a false hypothesis and thus
committing a Type 2 error.
56
Data Analysis
Examples of Parametric Tests
• T-test (Parametric test)
• T-test is used in the following circumstances:
When the sample size n<30.
Example:
• A certain pesticide is packed into bags by a
machine. A random sample of 10 bags is drawn
and their contents are found as follows:
50,49,52,44,45,48,46,45,49,45.Confirm whether
the average packaging can be taken to be 50 kgs.
The sample size is less than 30. Standard
deviations are not known using this test. We can
find out if there is any significant difference
between the two means i.e. whether the two
population means are equal.
57
Data Analysis
Illustration
• There are two nourishment programmes 'A' and 'B'.
Two groups of children are subjected to this. Their
weight is measured after six months.
• The first group of children subjected to the
programme 'A' weighed 44,37,48,60,41kgs. at
the end of programme.
• The second group of children were subjected to
nourishment programme 'B' and their weight
was 42, 42, 58, 64, 64, 67, 62 kgs at the end of
the programme. From the above, can we
conclude that nourishment programme 'B'
increased the weight of the children
significantly, given a 5% level of confidence?
58
Data Analysis
59
Solution
Nourishment Programme A Nourishment Programme B
X Y
44 -2 4 42 -15 225
37 -9 81 42 -15 225
48 2 4 58 1 1
60 14 196 64 7 49
41 -5 25 64 7 49
67 10 100
62 5 25
60
Data Analysis
t=
Here,
n1= 5, n2 = 7
∑Y = 399
61
=
Data Analysis
=
∑Y = 399
=310
= 674
= 57
=
62
Data Analysis
S2 =
S2 =
t=
63
Data Analysis
= = -1.89
64
Data Analysis
Analysis of Variance (ANOVA)
a)ANOVA
• It is a statistical technique.
• It is used to test the equality of three or
more sample means.
• Based on the means, inference is drawn
whether samples belongs to same
population or not.
65
Data Analysis
b) Conditions for using ANOVA
1. Data should be quantitative in nature.
67
Data Analysis
(d) Compare the value of F obtained above in (c)
with the critical value of F such as 5% level of
significance for the applicable degree of
freedom.
68
Data Analysis
Example: ANOVA is useful:
69
Data Analysis
Two-way ANOVA
• The procedure to be followed to calculate
variance is the same as it is for the one-way
classification. The example of two-way
classification of ANOVA is as follows:
Example:
• A firm has four types of machines - A , B, C
and D. It has put four of its workers on each
machines for a specified period, say one
week. At the end of one week, the average
output of each worker on each type of
machine was calculated. These data are given
below:
70
Data Analysis
Worker Average Production by type of machine
A B C D
Worker 1 25 26 23 28
Worker 2 23 22 24 27
Worker 3 27 30 26 32
Worker 4 29 34 27 33
71
Data Analysis
The firm is interested in knowing:
72
Data Analysis
Illustration
Company 'X' wants its employees to undergo three
different types of training programme with a view to
obtain improved productivity from them. After the
completion of the training programme, 16 new
employees are assigned at random to three training
methods and the production performance was
recorded. The training managers’ problem is to find
out if there are any differences in the effectiveness
of the training methods? The data recorded is as
under:
73
Data Analysis
Daily Output Of New Employees
Method 1 15 18 19 22 11
Method 2 22 27 18 21 17
Method 3 18 24 19 16 22 15
74
Data Analysis
Following steps are:
Where,
K = (n1+n2+n3-3)
75
Data Analysis
4.Calculate sample variance. It is calculated using
formula:
Si2 =
76
Data Analysis
77
Data Analysis
78
Solution
Method 1 Method 2 Method 3
15 22 24
18 27 19
19 18 16
22 21 22
11 17 15
18
85 105 114
79
Data Analysis
2. Grand mean: = = = 19
80
Data Analysis
N - n
5 17 19 -2 4 5X4 = 20
5 21 19 2 4 5X4 = 20
6 19 19 0 0 6X0 = 0
81
Data Analysis
= 20,
82
Data Analysis
Training Method -1 Training Method -2 Training Method -3
X- X- X-
= 70 62
83
=
Data Analysis
=12
2
So, s1 = 17.5, S22 = 15.5, s 32 = 12
84
Data Analysis
x17.5 + x12
5. Within column variance is
= 14.76
= 1.354
6. F =
=
85
Data Analysis
7. d.f of Numerator = (3 – 1) = 2.
10.The value is 3.81. This is the upper limit of acceptance region. Since
calculated value 1.354 lies within it, we can accept H0, the null hypothesis.
86
Data Analysis
Conclusion:
87
Statistical approaches
• Regression Analysis is a statistical procedure for analyzing
associative relationships between a metric dependent variable and
one or more independent variable.
• It can be used in the following ways:
• Determine whether the independent variable
explain a significant variations in the dependent
variable : whether a relationship exists
• Determine how much of the variation in the
dependent variable can be explained by the
dependent variables: strength of the relationship.
• Determine the structure or form of the
relationship : the mathematical equation relating
the independent and dependent variables.
• Predict the value of the dependent variable.
• Control for other independent variables when
evaluating the contribution of a specific variable or
set of variables.
88
Statistical approaches
Bivariate Regression : is a procedure for driving a mathematical
relationship , in the form of an equation , between a single metric
dependent variable and a single metric independent variable.
• This analysis is similar in manways to determining the
simple correlations between two variables.
• However, because an equation has to be derived , one
variable must be identified as the dependent variable
and the other as the independent variable.
• For Example,
• Can variation in sales be explained in terms of
variation in advertising expenditures? What is the
structure and form of this relationship , and can it be
modeled mathematically by an equation describing a
straight line?
• Can the variation in market share be accounted for by
the size of the sales force?
• Are consumers perceptions of quality determined by
their perceptions of price?
89
Statistical approaches
Bivariate regression Model the basic regression
equation is
Yi=ß0 +B1Xi+ e i
Where Y= dependent variable or criterion
variable , X= independent or predictor variable,
ß0=intercept of the line , ß1=slope of the line ,
and ei=the error term associated with ith
observation.
90
Statistical approaches
• Multiple Regression is a technique that simultaneously
develops a mathematical relationship between two or more
independent variables and an interval –scaled dependent
variable.
• Can variations in sales be explained in terms of
variations in advertising expenditure, prices, and
level of distribution?
• Can variation in market share be accounted for by
the size of the sales force, advertising expenditure
, and sales promotion budget?
• Are consumers perceptions of quality determined by
their perceptions of prices , brand image, and brand
attributes?
91
Statistical approaches
93