1499153291Module11Q1Univariateanalysis PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Items Description of Module

Subject Name Management


Paper Name Research Methodology
Module Title UNIVARIATE DATA ANALSIS
Module ID Module 11
Pre-Requisites Understanding the
Objectives To study the
Keywords
Role Name Affiliation
Prof.Ipshita Bansal Department of Management
Principal Investigator Studies, BPSMV, Khanpur
Kalan, Sonipat

Co-Principal Investigator

Prof. S.P.Singh Department of Management


Paper Coordinator Studies, GKV, Haridwar

Prof. S.P.Singh Department of Management


Content Writer (CW) Studies, GKV, Haridwar

Content Reviewer (CR)


Language Editor (LE)

QUADRANT –I

1. Module : Univariate Analysis


2. Learning Outcome
3. Descriptive and Inferential Analysis
4. Types of Descriptive Analysis
5. Measures of Central Tendency
6. Measures of Spread
Summary

1. Module: An Introduction to Business Research


2. LEARNING OUTCOME:
After studying this module, you shall be able to
 Know the basics of Univariate analysis
 Understand the descriptive and inferential analysis
 Comprehend the types of descriptive research
 Understand the measures of central tendency
 Become aware of the measures of spread
3. Introduction
Descriptive statistics define a sample or population. They can be part of exploratory data analysis.
Once the researcher has collected the raw data from both primary and secondary sources, the next
step is to analyze the same so as to draw logical inferences from them. In univariate analysis, one
variable is analyzed at a time. The data analysis could be of two types, namely, Descriptive and
inferential.

4. Descriptive and Inferential Analysis

Descriptive Inferential
analysis analysis

Figure 1 Types of analysis

4.1 Descriptive Analysis


Descriptive analysis involves the change of raw data into a format which makes it easy to
understand and interpret. Descriptive analysis regard abstract measures linked to the sample
data. The usual manners of expressing data in a concise form consist of computing average,
range, standard deviation, frequency and percentage distribution. The first things to do when
data analysis is taken up, is to describe the sample. A set of representative questions required
to be answered under descriptive statistics are:
What s the average income of the sample?
What is the average age of the sample?
The standard deviation of ages in the sample
The percentage of married sample respondents
What is the median age of the sample respondents?
The income group having the highest number of use of product in question in the sample
Is there an association between the frequency of purchase of product and income level of
the consumer?
Is the level of job satisfaction related with the age of the employees?
Which TV channel is viewed by the majority of viewers in the age group 20-30 years?
4.2 Types of descriptive analysis
The type of descriptive analysis to be carried out depends on the measurement of variables in
to four forms – nominal, ordinal, interval and ratio. A frequency table as well as the listing of
the mode(s) is adequate for nominal variables. For ordinal variables the median is an
appropriate measure of central tendency and the range its variation as a measure of
dispersion. For interval level variables, Arithmetic mean and standard deviation can be
applied for ratio variables. Geometric and harmonic mean are appropriate as measures of
central tendency and coefficient of variation as a measure of dispersion. Skewness and
kurtosis are the additional descriptors of the variable in case of interval and ratio data.

4.3 Inferential analysis


In inferential statistics, on the basis of the sample results, conclusions are drawn on
population parameters. The researcher makes an effort to generalize the results to the
population on the basis of sample results. The analysis is based on the probability theory
and a necessary existing state for conducting the inferential analysis is that the sample is
selected on a random basis. The inferential statistics covers an illustrative list of questions
as below:
 Does the average age of the population significantly differ from 35?
 Is the average income of population significantly greater than Rs.25, 000 per
months?
 Is the job satisfaction of unskilled workers significantly related with their `pay
packet?
 Is there a significant variance between the users and non-users of a brand?
 Is the sales growth of the company significant?
 Is there a significant correlation between the advertisement expenditure and
disposable income of the individual?
 Are the skill workers significantly more satisfied than the unskilled workers?
 Is there a significant different between urban and rural households in terms of
mean monthly expenditure on food?
 Is there a significant variance between the average starting salaries of fresh MBA
with marketing and finance specialization with respect to others?

4.4 Descriptive Analysis : Univariate Data


Univariate procedures deal with analysis of one variable at a time. The first step under
univariate analysis is the preparation of frequency distribution of each variable. The
frequency distribution is the counting of responses or observations of each of the
categories or codes assigned to a variable. Considering a nominal scale variable- gender
of respondents, table 1 presents the raw frequency and the percentages of responses for
each category in case of the variable gender:

Table 1

Gender Frequency Per cent Valid Per Cumulative per


cent cent
Male 401 77.11 77.11 77.11
Female 119 22.89 22.89 100.0
Total 520 100 100.0
This tabulation process can be done by hand using tall marks. However, in case of large
sample, the frequency distribution table is prepared using computer software. The results
indicate that out of a sample of 520 respondents, 401 are male and 119 female. The raw
frequencies are often converted into percentages as they are more meaningful. In the
present example 77.11 per cent are male and 22.89 per cent female respondents.

4.5 Missing data


There are situations when certain questions knowingly or unknowing are not answered by
the respondents. The responses corresponding to such respondents are treated as ‘missing
data’. The frequency distribution of the variable ‘marital status’ presented in the table 2.
Table 2
Variable Frequency Per cent Valid per cent Cumulative Per
cent
Single 295 70.07 70.23 70.07
Married 125 29.69 29.77 100.0
Total 420 99.76 100.0
Missing 1 .24
Total 421 100.0

If the marital status variable is examined in the table 2, the respondent who did not
answer the question on marital status coded as nine, was treated as missing data. The
missing value could as well be coded with another number. The only precaution to be
kept in mind is that a missing observation should be assigned a number that should not be
equal to the value of the variable obtained as part of the survey. If the value of missing
observation was available; it could perhaps lead to different research conclusions. The
intensity of the deviation of the actual results from the observed depend up to the number
of the missing observation and the extent to which the missing data would be different
from actual observation.
Table 2 shows that out of a sample of 421 respondents, 295 are single, 125 are married
and one observation is missing. In the column ‘percent’ 70.07 are single, 29.69 married
and .24 percent are missing observation. The percentages are computed on a total sample
of 421. As it is known that one observation is missing, the actual sample for his variable
should be 421. Therefore, a column name ‘valid per cent ‘has been added, where the
percentages are computed based on a sample of 413. The result using the ‘valid per cent’
column indicates that 70.23 per cent are single whereas 29.77 per cent are married. The
results in both cases are almost similar. This is so because there was only one single
missing value. Generally, if the volume of missing data is small, it is unlikely to affect
the conclusion from the analysis. This may not always be the case. It is for this reason
that the ‘valid’ per cent column should be used for interpreting the results.
Table 3 presents the frequency distribution of time of the day preferred to use café. The
number of missing observations in this case is50, amounting to 11.62 percent of the
sample. As a consequence of this, the results of ‘ per cent’ and ‘valid per cent’ vary,
especially for ‘afternoon’ ‘evening’ and’night’ response categories.

Table 3
Time Frequency Per cent Valid Cumulative
per cent Per cent
Morning 20 4.65 5.26 5,26
Noon 20 4.65 5.26 10.52
Afternoon 70 16.27 18.42 28.94
Evening 175 38.37 46.05 74,99
Night 95 22.09 25.01 100.0
Total 380 86,03 100
Missing 9 50 11.62
Total 430 100

There may be a variable where the cumulative frequencies in the percentages may be
very useful in interpretation of the results. Table 4 present the frequency distribution of
monthly household income of 415 respondents. There are 15 missing observation in this
table. Therefore, the analysis should be applicable using a sample of 404 respondents.
The ‘valid percent’ column should be used for interpretation of the results.
Table 4
Category Frequency Per cent Valid Cumulative
(in ,000) Per cent Per cent
Valid Less than 30 7.22 7.42 7.42
10
10-19,999 79 19.03 19.55 26.97
20 – 29,999 132 31.80 32.67 59.64
30-39,999 120 28.91 29.70 88.55
40-49,999 21 5.06 5.19 93.74
50 and above 22 5.30 5.44 100.0
Total 404 97.32 100.0
Missing 9 11 2.65
Total 415 100.0

The results indicate that 19.55 per cent of the respondents have a monthly household
income of 10 to 19,999 thousand whereas 5.44 per cent of the household have a monthly
income of 50, 000 and above.

Analysis of multiple responses


At time, the researcher comes across multiple category questions where respondents
could choose more than one answer. In such a case, the preparation of frequency table
and its interpretation is slightly different. If the question in the research study is multiple
category question and the responds are allowed to tick more than one choice, the
percentage in such a case may not add up to 100. For example, one may consider the
following question. When accessing the internet at a cyber café, tick up to four frequently
used applications for which you use the cyber café.
1. E-mail
2. Chat,
3. Browsing
4. Downloading
5. Shopping
6. Net telephone
7. Business and Commerce
8. Entertainment
9. Adult sites
10. Astrology and Horoscope
11. Education
12. Any other, please specify
The analysis shows that the most used application at a cyber café is e-mail. 94% per cent
of the users make use of. The second popular application is chatting, and 76.3 per cent of
the sample respondents make use of. Similarly, other applications in order of preference
are browsing (56 per cent), downloading (45 per cent), education 35.4 per cent,
Entertainment 32.6 percent and so on.

5. Measures of central tendency:


Measures of central tendency explain the key position of a frequency distribution for a group of
data. For example, the frequency distribution denotes the distribution and distinctive style of
marks 100 students got from the least to the maximum. Mean, median and mode are the three
measures of central tendency used in research.

Mean

Measures
Skewness of Central Median
Tendency

Mode

Figure 2 Measures of Central Tendency

5.1 Mean
The average of a variable is suitable for interval and ratio scale data. The mean is computed by
dividing the summation of the variation Xi, divided by the number of observations in the sample.
When interval or ratio scale data are grouped into categories or classes, the mean may be
computed by multiplying the midpoints of the class with the frequency of the class and divided by
number of classes.

5.2 Median
The median can be used for computing ratio, interval or ordinal scale data. The median is that
item of the distribution that half of the observations are less than it and half are greater than it.
The median for the ungrouped data is expressed as the central value when the data is managed in
increasing or decreasing sequence of magnitude. In case the number of items in the sample is odd,
the value of (n+1) /2th item give the median. However, if there are even number of items in the
sample , say of size 2n, the arithmetic mean of nth and (n+1)th items gives the median. The data
needs to be arranged in ascending or descending order of the magnitude before computing the
median.
For grouped data the median is computed by locating the median class and then using
interpolation by using the assumption that all items are evenly spread over the entire class
interval.

5.3 Mode
The mode is the measure of central tendency appropriate for nominal or higher order scales. It is
expressed as the point of maximum frequency in a particular distribution around which other
items of the set hover closely. Mode is appropriate for the computation of ordinal or interval data
when the data have been grouped first. The concept is widely used in business, e.g. a shoe store
owner would be naturally interested in knowing the size of the shoe that the majority of the
customers ask for. Similarly, a garment manufacturer is interested in determining the size of the
shirt that fits most people so as to plan its production accordingly.

5.4 Skewness

Skewness measures lack of symmetry in the distribution. In case of symmetrical distribution,


mean=median=mode. In a distribution positively skewed
mean >median>mode.
In such a case, the longer tail of the distribution is towards the right, the mode fall under the peak
and the mean changes its position as it is affected by extreme values. The same is the case with
negatively skewed distribution where arithmetic < mean< mode. The skewness is measured by
the difference between arithmetic mean and mode. If the value of arithmetic mean is greater than
mode, skewness is positive and if the value of mean is less than mode the skewness is negative.

6. Measures of spread
The ways of aggregating a group of data by indicating how spread the scores are represent the
measures of spread. For example in some data, the average score of 150 students may be 60 out
of 100. However, it is not necessary that all students scored 60 marks. Instead, their scores may
spread out. Some scores will be less than and others greater than average. Measures of spread
help in a summary the spread of scores. To explain this spread, different statistics are employed
such as range, quartiles, absolute deviation, standard deviation and variance.
Range

Coeffiient of
Qurtile Deviation
variation

Variance and
standard deviation
Figure 3 Measures of spread

6.1 Range:

Range is the easiest measure of dispersion defined as the distance between the highest value
and lowest value in an ordered set of values. Range provides difference on the end points of
distribution when its values are arranged in an order. The range could be computed for
interval and ratio scale data. Range the difference between the maximum value of the variable
and the minimum value of the variable. The range, however, considers only the extreme value
and ignores all other data points. The value of range could vary considerably from sample to
sample. Even with this limitation, range as a measure of dispersion is widely used in
industrial control for the preparation of control charts.

6.2 Quartile deviation


Quartile deviation is expressed as the difference between the third quartile and the first
quartile divided by two. The third quartile is the value of the variable (X) corresponding to
the first 75 per cent of the total frequency of grouped data. The first quartile is the value of
the variable (X) corresponding to the first 25 per cent of the total frequency of group data.
The quartile deviation can be computed as:
QD =

Where, Q 1 and Q 3 are the first and third quartile, respectively such that

First quartile, Q1

Where L1 is the lower limit of the first quartile class; F, the cumulative frequency of the
previous class with respect to the first quartile class; f1 is the frequency of the first quartile of
the first quartile class; and C is the width of class interval.
The relative measure with respect to the quartile deviation is given by the following formula
which is known as coefficient of deviation.

6.3 Variance and standard deviation


Variance is defined as the mean squared deviation of a variable from its arithmetic mean. The
positive square root of the variance is called the standard deviation. The variance is a difficult
measure to interpret and therefore, standard deviation is used as a measure of dispersion. The
population standard deviation is denoted by δ.
If the standard deviation is taken from the sample data, the following formula may be used
In case of grouped data, the following formulae for computing sample standard deviation
may be used.

δ=

The standard deviation is a very useful measure, as it has a relationship with mean in case of
normal distribution. It is known that 68 per cent of the observations lie within one standard
deviation of mean; 95.5 per cent of the observations lie within two standard deviations; and
99.7 per cent of the observations like within three standard deviations of mean in case of
normal distribution. These properties are very useful in sampling, correlation, etc. Another
common application of standard deviation is while testing the quality of two population
means.

6.4 Coefficient of variation


The coefficient of variation is computed for ratio scale measurements. To compare the
variability of two distributions a measure of relative dispersion called the coefficient of
variation can be used. This measure is independent of units of measurements. The formula of
coefficient of variation is:

CV = s/X x 100

The coefficient of variation is useful for comparing the variability of two distributions. This is
more useful measure when two distributions are entirely different and the units of
measurements are also different.
Summary:
This module introduces how the researcher should carry out data analyses once the data from
primary and secondary sources have been collected. The data analysis could be univariate,
bivariate or multivariate depending upon whether one variable, two variables or more than two
variable are being analyzed at a time. The analysis of data could be descriptive or inferential in
nature. Descriptive analysis deals with describing the sample. It discusses summary measures
relating to the sample data. They include summarizing data by calculating the average, frequency
distribution, range, standard deviation and percentage distributions. In the inferential analysis,
the concern is to draw inferences on population parameters based on sample results. The module
focuses on the descriptive analysis of univariate data.
In the descriptive analysis of univariate data are discussed the frequency distributions and
percentage distributions in case of nominal scale variable. The analysis is also explained for
multiple category and multiple response category questions. The treatment of missing data is also
addressed. The module explains analysis of ordinal scale data.

You might also like