1499153291Module11Q1Univariateanalysis PDF
1499153291Module11Q1Univariateanalysis PDF
1499153291Module11Q1Univariateanalysis PDF
Co-Principal Investigator
QUADRANT –I
Descriptive Inferential
analysis analysis
Table 1
If the marital status variable is examined in the table 2, the respondent who did not
answer the question on marital status coded as nine, was treated as missing data. The
missing value could as well be coded with another number. The only precaution to be
kept in mind is that a missing observation should be assigned a number that should not be
equal to the value of the variable obtained as part of the survey. If the value of missing
observation was available; it could perhaps lead to different research conclusions. The
intensity of the deviation of the actual results from the observed depend up to the number
of the missing observation and the extent to which the missing data would be different
from actual observation.
Table 2 shows that out of a sample of 421 respondents, 295 are single, 125 are married
and one observation is missing. In the column ‘percent’ 70.07 are single, 29.69 married
and .24 percent are missing observation. The percentages are computed on a total sample
of 421. As it is known that one observation is missing, the actual sample for his variable
should be 421. Therefore, a column name ‘valid per cent ‘has been added, where the
percentages are computed based on a sample of 413. The result using the ‘valid per cent’
column indicates that 70.23 per cent are single whereas 29.77 per cent are married. The
results in both cases are almost similar. This is so because there was only one single
missing value. Generally, if the volume of missing data is small, it is unlikely to affect
the conclusion from the analysis. This may not always be the case. It is for this reason
that the ‘valid’ per cent column should be used for interpreting the results.
Table 3 presents the frequency distribution of time of the day preferred to use café. The
number of missing observations in this case is50, amounting to 11.62 percent of the
sample. As a consequence of this, the results of ‘ per cent’ and ‘valid per cent’ vary,
especially for ‘afternoon’ ‘evening’ and’night’ response categories.
Table 3
Time Frequency Per cent Valid Cumulative
per cent Per cent
Morning 20 4.65 5.26 5,26
Noon 20 4.65 5.26 10.52
Afternoon 70 16.27 18.42 28.94
Evening 175 38.37 46.05 74,99
Night 95 22.09 25.01 100.0
Total 380 86,03 100
Missing 9 50 11.62
Total 430 100
There may be a variable where the cumulative frequencies in the percentages may be
very useful in interpretation of the results. Table 4 present the frequency distribution of
monthly household income of 415 respondents. There are 15 missing observation in this
table. Therefore, the analysis should be applicable using a sample of 404 respondents.
The ‘valid percent’ column should be used for interpretation of the results.
Table 4
Category Frequency Per cent Valid Cumulative
(in ,000) Per cent Per cent
Valid Less than 30 7.22 7.42 7.42
10
10-19,999 79 19.03 19.55 26.97
20 – 29,999 132 31.80 32.67 59.64
30-39,999 120 28.91 29.70 88.55
40-49,999 21 5.06 5.19 93.74
50 and above 22 5.30 5.44 100.0
Total 404 97.32 100.0
Missing 9 11 2.65
Total 415 100.0
The results indicate that 19.55 per cent of the respondents have a monthly household
income of 10 to 19,999 thousand whereas 5.44 per cent of the household have a monthly
income of 50, 000 and above.
Mean
Measures
Skewness of Central Median
Tendency
Mode
5.1 Mean
The average of a variable is suitable for interval and ratio scale data. The mean is computed by
dividing the summation of the variation Xi, divided by the number of observations in the sample.
When interval or ratio scale data are grouped into categories or classes, the mean may be
computed by multiplying the midpoints of the class with the frequency of the class and divided by
number of classes.
5.2 Median
The median can be used for computing ratio, interval or ordinal scale data. The median is that
item of the distribution that half of the observations are less than it and half are greater than it.
The median for the ungrouped data is expressed as the central value when the data is managed in
increasing or decreasing sequence of magnitude. In case the number of items in the sample is odd,
the value of (n+1) /2th item give the median. However, if there are even number of items in the
sample , say of size 2n, the arithmetic mean of nth and (n+1)th items gives the median. The data
needs to be arranged in ascending or descending order of the magnitude before computing the
median.
For grouped data the median is computed by locating the median class and then using
interpolation by using the assumption that all items are evenly spread over the entire class
interval.
5.3 Mode
The mode is the measure of central tendency appropriate for nominal or higher order scales. It is
expressed as the point of maximum frequency in a particular distribution around which other
items of the set hover closely. Mode is appropriate for the computation of ordinal or interval data
when the data have been grouped first. The concept is widely used in business, e.g. a shoe store
owner would be naturally interested in knowing the size of the shoe that the majority of the
customers ask for. Similarly, a garment manufacturer is interested in determining the size of the
shirt that fits most people so as to plan its production accordingly.
5.4 Skewness
6. Measures of spread
The ways of aggregating a group of data by indicating how spread the scores are represent the
measures of spread. For example in some data, the average score of 150 students may be 60 out
of 100. However, it is not necessary that all students scored 60 marks. Instead, their scores may
spread out. Some scores will be less than and others greater than average. Measures of spread
help in a summary the spread of scores. To explain this spread, different statistics are employed
such as range, quartiles, absolute deviation, standard deviation and variance.
Range
Coeffiient of
Qurtile Deviation
variation
Variance and
standard deviation
Figure 3 Measures of spread
6.1 Range:
Range is the easiest measure of dispersion defined as the distance between the highest value
and lowest value in an ordered set of values. Range provides difference on the end points of
distribution when its values are arranged in an order. The range could be computed for
interval and ratio scale data. Range the difference between the maximum value of the variable
and the minimum value of the variable. The range, however, considers only the extreme value
and ignores all other data points. The value of range could vary considerably from sample to
sample. Even with this limitation, range as a measure of dispersion is widely used in
industrial control for the preparation of control charts.
Where, Q 1 and Q 3 are the first and third quartile, respectively such that
First quartile, Q1
Where L1 is the lower limit of the first quartile class; F, the cumulative frequency of the
previous class with respect to the first quartile class; f1 is the frequency of the first quartile of
the first quartile class; and C is the width of class interval.
The relative measure with respect to the quartile deviation is given by the following formula
which is known as coefficient of deviation.
δ=
The standard deviation is a very useful measure, as it has a relationship with mean in case of
normal distribution. It is known that 68 per cent of the observations lie within one standard
deviation of mean; 95.5 per cent of the observations lie within two standard deviations; and
99.7 per cent of the observations like within three standard deviations of mean in case of
normal distribution. These properties are very useful in sampling, correlation, etc. Another
common application of standard deviation is while testing the quality of two population
means.
CV = s/X x 100
The coefficient of variation is useful for comparing the variability of two distributions. This is
more useful measure when two distributions are entirely different and the units of
measurements are also different.
Summary:
This module introduces how the researcher should carry out data analyses once the data from
primary and secondary sources have been collected. The data analysis could be univariate,
bivariate or multivariate depending upon whether one variable, two variables or more than two
variable are being analyzed at a time. The analysis of data could be descriptive or inferential in
nature. Descriptive analysis deals with describing the sample. It discusses summary measures
relating to the sample data. They include summarizing data by calculating the average, frequency
distribution, range, standard deviation and percentage distributions. In the inferential analysis,
the concern is to draw inferences on population parameters based on sample results. The module
focuses on the descriptive analysis of univariate data.
In the descriptive analysis of univariate data are discussed the frequency distributions and
percentage distributions in case of nominal scale variable. The analysis is also explained for
multiple category and multiple response category questions. The treatment of missing data is also
addressed. The module explains analysis of ordinal scale data.