Data Analysis RM
Data Analysis RM
Missing observation should be assigned a number that should not be equal to the value of the
variable obtained as part of the survey. If the value of the missing observation was available; it
could perhaps lead to different research conclusions. The intensity of the deviation of the
actual results from the observed depends upon the number of missing observations and the
extent to which the missing data would be different from actual observation.
Generally, if the volume of missing data is small, it is unlikely to affect the conclusion from
the analysis. This may not always be the case. It is for this reason that the ‘valid per cent’
column should be used for interpreting the results.
Analysis of Multiple Responses
At times, the researcher comes across multiple category questions where respondents could choose
more than one answer. In such a case, the preparation of frequency table and its interpretation is
slightly different.
If the question in the research study is multiple category question and the respondents are allowed
to tick more than one choice, the percentage in such a case may not add up to 100.
Ex- When accessing the internet at a cyber café, tick up to frequently used applications for which
you use the cyber café.
1) E-mail
2) Chat
3) Browsing
4) Downloading
5) Shopping
6) Entertainment
7) Education
Analysis of Multiple Respondents
In Table 11.7 the percentages are computed on the total sample size of 414. If these
percentages are added up, they would exceed more than 100 per cent. This is because of
multiplicity of answers as respondents were given the chance to choose
more than one answer. The interpretation of the table would be based on a sample of 414
and is given as:
• The most used application at a cyber café is e-mail. It is seen that 94.9 per cent of the users
make use of this.
• The second popular application is chatting, and 76.3 per cent of the sample respondents
make use of it.
• Similarly, other applications in order of preference are browsing (56 per cent),
downloading (47.6 per cent), education 35.4 per cent), entertainment (32.6 per cent) and so
on.
Analysis of Ordinal Scaled Questions
Rank the following five attributes while choosing a restaurant for dinner. Assign a rank of
1 to the most important, 2 to the next important ... and 5 to the least important.
• – Ambience
• – Food quality
• – Menu variety
• – Service
• – Location
From a sample of 32, the responses obtained are given in Table 11.8. To construct
univariate tables out of the given data, one can take up one column at a time from Table
11.8 and prepare the separate frequency tables. For example, distribution of rank assigned
to attribute food quality may be considered in Table 11.9.
Grouping Large Data Sets
Sometimes data collected is very large and needs to be collapsed for interpretation.
Sometimes the data indicates that there are too many categories to allow quick
interpretation of the results.
This could be facilitated by recoding the data into fewer broader categories.
Similar analysis could be carried out in the case of interval scale data.
Measures of Central Tendency
Mean
Mean represents the arithmetic average of a variable.
It is appropriate for interval and ratio scale data.
Median
The median can be computed for ratio, interval or ordinal scale data.
The median is that value in the distribution such that 50 per cent of the observations are
below it and 50 per cent are above it.
The median for the ungrouped data is defined as the middle value when the data is
arranged in ascending or descending order of magnitude.
In case the number of items in the sample is odd, the value of (n + 1)/2th item gives the
median.
However if there are even number of items in the sample, say of size 2n, the arithmetic
mean of nth and (n + 1)th items gives the median.
The median could also be computed for the grouped data.
Measures of Central Tendency
Mode
Mode is that measure of central tendency which is appropriate for nominal or higher order scales.
It is the point of maximum frequency in a distribution around which other items of the set cluster
densely.
Mode should not be computed for ordinal or interval data unless these data have been grouped
first.
The concept is widely used in business, e.g. a shoe store owner would be naturally interested in
knowing the size of the shoe that the majority of the customers ask for. Similarly, a garment
manufacturer is interested in determining the size of the shirt that fits most people so as to plan its
production accordingly.
Formula: Mode = l + [(f – f1)/2f-f1-f2]*h
l = lower limit of modal class
f1,f2 = frequency of class preceding modal class and following modal class.
f = frequency of modal class.
h = size of the class interval
Skewness
It measures lack of symmetry in the distribution.
In case of symmetrical distribution, mean = median = mode.
For a positively skewed distribution, mean > median > mode.
In such a case, the longer tail of the distribution is towards the right, the mode falls under
the peak and the mean changes its position as it is affected by extreme values. The same is
the case with negatively skewed distribution where arithmetic mean < median < mode.
The skewness is measured by the difference between arithmetic mean and mode. If the
value of arithmetic mean is greater than mode, skewness is positive and if the value of the
expression is negative, skewness is negative.
Measures of Dispersion
The measures of central tendency locate the centre of the distribution. However, they do not provide enough
information to the researcher to fully understand the distribution being examined.
For example, measures of central tendency do not indicate how items are spread out on either side of the centre.
Therefore, there is a need to study the spread of a distribution of a variable and the methods which provide that are
called measures of dispersion.
The study of dispersion could help in taking better decisions. This is because small dispersion indicates high
uniformity of the items, whereas large variability denotes less uniformity. If returns on a particular investment
show lot of variability (dispersion), it means a risky investment as compared to the one where variability is very
small. A company may not only be interested in finding out the average sales of a product but also the variability in
the sales over time.
Measures of Dispersion:
1) Range
This is the simplest measure of dispersion and is defined as the distance between the highest (maximum) value and
the lowest (minimum) value in an ordered set of values.
The range could be computed for interval scale and ratio scale data.
Range = Xmax – Xmin
Xmax = Maximum value of the variable
Xmin = Minimum value of the variable
Measures of Dispersion
Measures of Dispersion:
1) Range
This is the simplest measure of dispersion and is defined as the distance between the highest (maximum)
value and the lowest (minimum) value in an ordered set of values.
The range could be computed for interval scale and ratio scale data.
Range = Xmax – Xmin
Xmax = Maximum value of the variable
Xmin = Minimum value of the variable
The limitation of range as a measure of dispersion is that it considers only the extreme value and ignores
all other data points.
The value of range could vary considerably from sample to sample.
Measures of Dispersion
2) Variance and standard deviation
Variance is defined as the mean squared deviation of a variable from its arithmetic mean
The positive square root of the variance is called standard deviation.
The variance is a difficult measure to interpret and, therefore, standard deviation is used as a measure of
dispersion.
The population standard deviation is denoted by s and computed using the following formula:
X = Value of observations
u=
Descriptive Analysis of Bivariate Data
Bivariate analysis examines the relationship between two variables.
There are three types of measures used for carrying out bivariate analysis.
1) Cross-tabulation
2) Spearman’s rank correlation coefficient
3) Pearson’s linear correlation coefficient
In simple tabulation, the frequency and the percentage for each question was calculated.
In cross-tabulation, responses to two questions are combined and data is tabulated together.
For example, in cross-tabulating a two- category measure of income (low- and high-income households)
with a two-category measure of purchase intention of a product (low and high purchase intentions) the
basic result is a cross-classification as shown in Table.
The results of cross-tabulation show the number of sample respondents with low income having
low purchase intention, low income with high purchase intention, high income with low
purchase intention and high income with high purchase intention.
As is the case with simple tabulations, the results of a cross-tabulation are more meaningful if
cell frequencies are computed as percentages.
the percentages can be computed (1) row-wise so that the percentages in each row add up to 100
per cent; (2) column-wise so that the percentages in each column add up to 100 per cent or (3)
cell percentages, such that percentages added across all cells equal 100 per cent. The
interpretation of percentages is different in each of the three cases.
The basis for calculating category percentage depends upon the nature of relationship between
the variables. One of the variables could be viewed as dependent variable and the other one as
independent variable.
The purchase intention could be treated as dependent variable, which depends upon income
(independent variable). The rule is to cast percentages in the direction of independent (causal)
variable across the dependent variable.
The results indicate that with increase in income, the purchase intention for the product increases.
Just because there is a high association between two variables, it does not imply a cause-and-effect
relationship.
Correlation
Correlation measures the degree of association between two or more variables. When we
are dealing with two variables, we are talking in terms of simple correlation and when
more than two variables are involved, the subject matter of interest is called multiple
correlation.
Positive correlation: When two variables X and Y move in the same direction, the
correlation between the two is positive. If one variable increases, the other variable also
increases. Ex- sales revenue and the advertising expenditure.
Negative correlation: When two variables X and Y move in the opposite direction, the
correlation is negative. If one variable increases, the other decreases and vice versa. Ex-
quantity demanded and the price of the commodity.
Zero correlation: The correlation between two variables X and Y is zero when the
variables move in no connection with each other. If the variable X increases, Y may
increase or decrease in some situation.
Spearman’s Rank Order Correlation
Suppose in a beauty contest two judges are asked to rank ten female participants. A rank
correlation coefficient between the ranks awarded by two judges would give how
consistent they are in awarding the rank. The Spearman’s rank correlation coefficient is
given by
The rank correlation coefficient takes a value between –1 and +1. In case the value is +1, it
indicates a complete agreement between the ranks assigned by two judges, whereas the
value of –1 indicates a complete disagreement.
Example: Two judges in a beauty contest evaluate ten participants. A rank of one was assigned
to the most beautiful candidate, two to the next and so on. Compute the rank order correlation
and comment on the value.
It is seen that there is a high degree of positive rank correlation coefficient which implies that
there is a strong agreement between two judges on their opinion about the beauty of
contestants.
Karl Pearson Linear Correlation
A quantitative estimate of a linear correlation between two variables X and Y is given by Karl Pearson as:
The linear correlation coefficient takes a value between –1 and +1 (both values inclusive).
If the value of the correlation coefficient is equal to 1, the two variables are perfectly positively correlated and the
scatter of the points of the variables X and Y will lie on a positively sloped straight line.
Similarly, if the correlation coefficient between the two variables X and Y is –1, the scatter of the points of these
variables will lie on a negatively sloped straight line and such a correlation will be called a perfectly negative
correlation.
It may be noted that the closer the scatter of points to the line, higher is the degree of correlation between the
Data Transformation
Under data transformation, the original data is changed to a new format for performing
data analysis so as to achieve the objectives of the study. This is generally done by the
researcher through creating new variables or by modifying the values of the scaled data.
At times it may become essential to collapse or combine adjacent categories of a variable
so as to reduce the number of categories of original variables. In a 5-point Likert scale,
having categories like strongly agree, agree, neither agree nor disagree, disagree and
strongly disagree can be clubbed into three categories. One can combine strongly agree
and agree category into one category. Similarly, disagree and strongly disagree responses
could be clubbed into a separate category and neither agree nor disagree could be treated
as a separate category. This is how a five-category scale can be collapsed into a three-
category one.