Descr Iptive Statis Tics: Inferential Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 36

DESCR

IPTIVE
STATIS
Descriptive statistics are used to describe the basic features of the
data in a study. They provide simple summaries about TICS
the sample and
the measures. Together with simple graphics analysis, they form the
basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential


statistics. With descriptive statistics you are simply describing what is
or what the data shows. With inferential statistics, you are trying to
reach conclusions that extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample
data what the population might think. Or, we use inferential statistics
to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have
happened by chance in this study. Thus, we use inferential statistics
to make inferences from our data to more general conditions; we use
descriptive statistics simply to describe what's going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a


manageable form. In a research study we may have lots of measures.
Or we may measure a large number of people on any measure.
Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a
simpler summary. For instance, consider a simple number used to
summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by
the number of times at bat (reported to three significant digits). A
batter who is hitting .333 is getting a hit one time in every three at
bats. One batting .250 is hitting one time in four. The single number
describes a large number of discrete events. Or, consider the scourge
of many students, the Grade Point Average (GPA). This single
number describes the general performance of a student across a
potentially wide range of course experiences.
Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing
important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether she's been
in a slump or on a streak. The GPA doesn't tell you whether the
student was in difficult courses or easy ones, or whether they were
courses in their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary that may
enable comparisons across people or other units.

UNIVA
RIATE
ANALY
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
SIS
variable that we tend to look at:

 the distribution

 the central tendency

 the dispersion

In most situations, we would describe all three of these characteristics


for each of the variables in our study.

The Distribution. The distribution is a summary of the frequency of


individual values or ranges of values for a variable. The simplest
distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe
the distribution of college students is by year in college, listing the
number or percent of students at each of the four years. Or, we
describe gender by listing the number or percent of males and
females. In these cases, the variable has few enough values that we
can list each one and summarize how many sample cases had the
value. But what do we do for a variable like income or GPA? With
these variables there can be a large number of possible values, with
relatively few people having each one. In this case, we group the raw
scores into categories according to ranges of values. For instance, we
might look at GPA according to the letter grade ranges. Or, we might
group income into four or five ranges of income values.

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with


a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as
a table or as a graph. Table 1 shows an age frequency distribution
with five categories of age ranges defined. The same frequency
distribution can be depicted in a graph as shown in Figure 1. This type
of graph is often referred to as a histogram or bar chart.
Figure 1. Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example,


you could use percentages to describe the:

 percentage of people in different income levels

 percentage of people in different age ranges

 percentage of people in different ranges of standardized test scores

MEAS
URE
OF
A measure of central tendency is a summary statistic that represents the centre point

CENTR
or typical value of a dataset. These measures indicate where most values in a
distribution fall and are also referred to as the central location of a distribution. You
can think of it as the tendency of data to cluster around a middle value. In statistics,

AL
the three most common measures of central tendency are the mean, median,
and mode. Each of these measures calculates the location of the central point using
a different method.

TENDE
Choosing the best measure of central tendency depends on the type of data you

NCY
have. In this post, I explore these measures of central tendency, show you how to
calculate them, and how to determine which one is best for your data.

Mean
The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just
add up all of the values and divide by the number of observations in your dataset.

The calculation of the mean incorporates all values in the data. If you change any
value, the mean changes. However, the mean doesn’t always locate the center of
the data accurately. Observe the histograms below where I display the mean in the
distributions.

In a symmetric distribution, the mean locates the center accurately.


However, in a skewed distribution, the mean can miss the mark. In the histogram
above, it is starting to fall outside the central area. This problem occurs
because outliershave a substantial impact on the mean. Extreme values in an
extended tail pull the mean away from the center. As the distribution becomes more
skewed, the mean is drawn further away from the center. Consequently, it’s best to
use the mean as a measure of the central tendency when you have a symmetric
distribution.

When to use the mean: Symmetric distribution, Continuous data

Related post: Using Histograms to Understand Your Data

Median
The median is the middle value. It is the value that splits the dataset in half. To find
the median, order your data from smallest to largest, and then find the data point that
has an equal amount of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd
number of values. I’ll show you how to find the median for both cases. In the
examples below, I use whole numbers for simplicity, but you can have decimal
places.

In the dataset with the odd number of observations, notice how the number 12 has
six values above it and six below it. Therefore, 12 is the median of this dataset.

When there is an even number of values, you count in to the two innermost values
and then take the average. The average of 27 and 29 is 28. Consequently, 28 is the
median of this dataset.
Outliers and skewed data have a smaller effect on the median. To understand why,
imagine we have the Median dataset below and find that the median is 46. However,
we discover data entry errors and need to change four values, which are shaded in
the Median Fixed dataset. We’ll make them all significantly higher so that we now
have a skewed distribution with large outliers.

As you can see, the median doesn’t change at all. It is still 46. Unlike the mean, the
median value doesn’t depend on all the values in the dataset. Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course,
with other types of changes, the median can change. When you have a skewed
distribution, the median is a better measure of central tendency than the mean.
Comparing the mean and median

Now, let’s test the median on the symmetrical and skewed distributions to see how it
performs, and I’ll include the mean on the histograms so we can make comparisons.

In a symmetric distribution, the mean and median both find the center accurately.
They are approximately equal.
In a skewed distribution, the outliers in the tail pull the mean away from the center
towards the longer tail. For this example, the mean and median differ by over 9000,
and the median better represents the central tendency for the distribution.

These data are based on the U.S. household income for 2006. Income is the classic
example of when to use the median because it tends to be skewed. The median
indicates that half of all incomes fall below 27581, and half are above it. For these
data, the mean overestimates where most household incomes fall.

When to use the median: Skewed distribution, Continuous data, Ordinal data

Mode
The mode is the value that occurs the most frequently in your data set. On a bar
chart, the mode is the highest bar. If the data have multiple values that are tied for
occurring the most frequently, you have a multimodal distribution. If no value
repeats, the data do not have a mode.

In the dataset below, the value 5 occurs most frequently, which makes it the mode.
These data might represent a 5-point Likert scale.
Typically, you use the mode with categorical, ordinal, and discrete data. In fact, the
mode is the only measure of central tendency that you can use with categorical data
—such as the most preferred flavor of ice cream. However, with categorical data,
there isn’t a central value because you can’t order the groups. With ordinal and
discrete data, the mode can be a value that is not in the center. Again, the mode
represents the most common value.

In the graph of service quality, Very Satisfied is the mode of this distribution because
it is the most common value in the data. Notice how it is at the extreme end of the
distribution. I’m sure the service providers are pleased with these results!

Finding the mode for continuous data


In the continuous data below, no values repeat, which means there is no mode. With
continuous data, it is unlikely that two or more values will be exactly equal because
there are an infinite number of values between any two values.

When you are working with the raw continuous data, don’t be surprised if there is no
mode. However, you can find the mode for continuous data by locating the maximum
value on a probability distribution plot. If you can identify a probability distribution that
fits your data, find the peak value and use it as the mode.

The probability distribution plot displays a lognormal distribution that has a mode of
16700. This distribution corresponds to the U.S. household income example in the
median section.
When to use the mode: Categorical data, Ordinal data, Count data, Probability
Distributions

Which is Best—the Mean, Median, or Mode?


When you have a symmetrical distribution for continuous data, the mean, median,
and mode are equal. In this case, analysts tend to use the mean because it includes
all of the data in the calculations. However, if you have a skewed distribution, the
median is often the best measure of central tendency.

When you have ordinal data, the median or mode is usually the best choice. For
categorical data, you have to use the mode.

In cases where you are deciding between the mean and median as the better
measure of central tendency, you are also determining which types of
statistical hypothesis testsare appropriate for your data—if that is your ultimate goal.
I have written an article that discusses when to use parametric (mean) and

MEAS
nonparametric (median) hypothesis testsalong with the advantages .

URE
OF
As the name suggests, the measure of dispersion shows the
scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The
DISPE
measure of dispersion shows the homogeneity or the heterogeneity of
the distribution of the observations.
RSION
Suppose you have four datasets of the same size and the mean is also
same, say, m. In all the cases the sum of the observations will be the
METH
same. Here, the measure of central tendency is not giving a clear and
complete idea about the distribution for the four given sets.
OD OF
Can we get an idea about the distribution if we get to know about the
MEAS
dispersion of the observations from one another within and between
the datasets? The main idea about the measure of dispersion is to get
URE
to know how the data are spread. It shows how much the data vary
from their average value.
OF
DISPE
RSION
By now, you must have come across or learnt different measures of central
tendency. Measures of central tendency facilitate the representation of the entire
mass of the data with a single value. Can the central tendency describe the data
wholly and accurately? No, and that is precisely why we need measures of
dispersion. For instance, the hourly income of professionals in two offices are:

Office A : 30  50  50  65  70  90  100

Office B : 60  60  70  65  65  65  70

Here, evidently, the mean of both the sections is the same, that is,  65

 In office A, the observations are much more away from the mean.
 In office B, almost all the observations are close to the mean. Certainly,
both the offices differ even though their mean is the same.
Therefore it is required to differentiate between the groups. We need some other
measures with regards to the measure of scattered-ness (or spread). For this
purpose, we study this topic known as measures of dispersion.

In simple words, ‘dispersion’ is a lack of uniformity in the sizes or quantities of the


items of a group or series. According to Reiglemen, “Dispersion is the extent to
which the magnitudes or quantities of the items differ, the degree of diversity.” The
word may also be used to address the spread of the data.

Types of Dispersion
The measures of dispersion can be ‘absolute’ or ‘relative’. In the case of absolute
measures of dispersion, they are stated in the same units in which the original data
is expressed. For instance, if a group of data expresses the number of shoes a group
of people own; the absolute dispersion will provide the values in numbers.

Relative dispersion, on the other hand, is the ratio of a measure of absolute


dispersion to an appropriate average. The main benefit of this measure is that two
or more series can be compared with each other even if they are expressed in
different units.
Methods of Dispersion
Methods of studying dispersion are divided into two types :

(i) Mathematical Methods: We can study the ‘degree’ and ‘extent’ of variation with
the use of these methods. The measures of dispersion included in this category are :

(a) Range

(b) Quartile Deviation

(c) Average Deviation

(d) Standard deviation and coefficient of variation.

(ii) Graphic Methods: If only the extent of variation is studied, whether it is higher
or lower,  a Lorenz-curve is put to use.

Locating the Center of Your Data


Most articles that you’ll read about the mean, median, and mode focus on how you
calculate each one. I’m going to take a slightly different approach to start out. My
philosophy throughout my blog is to help you intuitively grasp statistics by focusing
on concepts. Consequently, I’m going to start by illustrating the central point of
several datasets graphically—so you understand the goal. Then, we’ll move on to
choosing the best measure of central tendency for your data and the calculations.

The three distributions below represent different data conditions. In each distribution,
look for the region where the most common values fall. Even though the shapes and
type of data are different, you can find that central location. That’s the area in the
distribution where the most common values are located.
As the graphs highlight, you can see where most values tend to occur. That’s the
concept. Measures of central tendency represent this idea with a value. Coming up,
you’ll learn that as the distribution and kind of data changes, so does the best
measure of central tendency. Consequently, you need to know the type of data you
have, and graph it, before choosing a measure of central tendency!

Related posts: Guide to Data Types and How to Graph Them

The central tendency of a distribution represents one characteristic of a distribution.


Another aspect is the variability around that central value. While measures of
variability is the topic of a different article (link below), this property describes how far
away the data points tend to fall from the center. The graph below shows how
distributions with the same central tendency (mean = 100) can actually be quite
different. The panel on the left displays a distribution that is tightly clustered around
the mean, while the distribution on the right is more spread out. It is crucial
to understand that the central tendency summarizes only one aspect of a
distribution and that it provides an incomplete picture by itself.
Related post: Measures of Variability: Range, Interquartile Range, Variance, and
Standard Deviation

Mathematical Methods

 Range
Two sections of 10 students each in class XII in a school were given a common
test in Economics (40 maximum marks). The scores of the students are given
below:

Section A:  6  9  11  13  15  21  23  28  29  35

Section B: 15 16 16 17  18  19  20  21  23  25

The average score in section A is 19. The average score in section B is 19.

In the above cited example, we observe that:

 the scores of all the students in section A are ranging from 6 to 35;
 the scores of the students in section B are ranging from 15 to 25.
The difference between the largest and the smallest scores in section A is 29 (35-6)
The difference between the largest and smallest scores in section B is 10 (25-15)

Thus, the difference between the largest and the smallest value of a data, is termed
as the range of the distribution.  Range does not consider all the values of a series,
i.e. it takes only the extreme items and middle items are not considered significant.
Therefore, Range is not sufficient to explain about the character of the distribution.
The concept of range is useful in the field of quality control and to study the
variations in the prices of the shares etc.

 Quartile Deviation
The quartile deviation is a slightly better measure of absolute dispersion than the
range, although it ignores the observations on the ends (tails). It helps in knowing
the range within which certain proportion of the items fall lay. It only considers the
values of the ‘Upper quartile’ (Q3) and the ‘Lower quartile’ (Q1).

Inter Quartile Range = Q3 – Q1 .

The Inter-Quartile Range is based upon the 50% of the values in a distribution
which lay in the middle; and hence is unaffected by extreme values. Half of the
Inter-Quartile Range is called Quartile Deviation (Q.D.).

Thus: Q .D . = (Q3 – Q1)/2

Q.D. is therefore also called Semi Inter Quartile Range.

In individual and discrete series, Q1 is the size of [(n +1)/ 4] th value, but in a
continuous distribution, it is the size of n/4 th value. Similarly, for Q3 and median
also, n is used in place of n+1.

A relative measure of dispersion based on the quartile deviation is called the


coefficient of quartile deviation. It is just a number without any units of
measurement. It can be used for comparing the dispersion of two or more sets of
data.

 Average Deviation
Average deviation can be defined as the arithmetic mean of the absolute deviations
(ignoring the negative signs) of various items from Mean, Mode or Median.

Calculation of mean deviation:


Individual Series

Discrete Series

Continuous Series

Where,

MD = Mean deviation

| D | = Deviations from mean or median ignoring + Signs

N = Number of item (Individual Series)

N = Total number of Frequencies (Discrete and continuous series)

F = Number of frequencies.

 Standard Deviation
Standard deviation is one of the best and popularly used measures of dispersion.
Standard deviation is the square root of the arithmetic mean of the squares of
deviation of its items from their arithmetic mean. The concept of standard
deviation, which was introduced by Karl Pearson is useful in assessing the
representativeness of the mean. It has a practical significance because it does not
come with the problems associated with a range, quartile deviation or average
deviation.

Calculations for the same, are as under:

Individual Series:
1. Actual Mean Method

2. Assumed Mean Method

Discrete/Continuous Series:
1. Actual Mean Method

2. Assumed Mean Method

3. Step Deviation Method

x C
Coefficient of variation:
This Is the most apt measure when two or more groups of similar data are to be
compared with respect to stability (or uniformly or consistency or homogeneity). It
is the ratio of the standard deviation to the mean.

Where C.V. = Coefficient of variation


σ= Standard deviation
X = Arithmetic mean

Graphical Methods
Lorenz Curve: A Lorenz Curve can be defined as a graph on which the
cumulative percentage of total national income (or some other variable) is plotted
against the cumulative percentage of the corresponding population (ranked in
increasing size of share). The extent to which the curve sags below a straight
DESCR
diagonal line indicates the degree of inequality of distribution.

IPTIVE
STATI
STICS
Perhaps the most common Data Analysis tool that you’ll use in

USING
Excel is the one for calculating descriptive statistics. To see how
this works, take a look at this worksheet. It summarizes sales data
for a book publisher. MS
EXCEL
In column A, the worksheet shows the suggested retail price (SRP).
In column B, the worksheet shows the units sold of each book
through one popular bookselling outlet. You might choose to use
the Descriptive Statistics tool to summarize this data set.
To calculate descriptive statistics for the data set, follow these
steps:
1. Click the Data tab’s Data Analysis command button to tell
Excel that you want to calculate descriptive statistics.
Excel displays the Data Analysis dialog box.

2. In Data Analysis dialog box, highlight the Descriptive Statistics


entry in the Analysis Tools list and then click OK.
Excel displays the Descriptive Statistics dialog box.
3. In the Input section of the Descriptive Statistics dialog box,
identify the data that you want to describe.
o To identify the data that you want to describe
statistically: Click the Input Range text box and then enter the
worksheet range reference for the data. In the case of the
example worksheet, the input range is $A$1:$C$38. Note that
Excel wants the range address to use absolute references —
hence, the dollar signs.

To make it easier to see or select the worksheet range, click


the worksheet button at the right end of the Input Range text
box. When Excel hides the Descriptive Statistics dialog box,
select the range that you want by dragging the mouse. Then
click the worksheet button again to redisplay the Descriptive
Statistics dialog box.
o To identify whether the data is arranged in columns or
rows: Select either the Columns or the Rows radio button.
o To indicate whether the first row holds labels that
describe the data:Select the Labels in First Row check box. In
the case of the example worksheet, the data is arranged in
columns, and the first row does hold labels, so you select the
Columns radio button and the Labels in First Row check box.
4. In the Output Options area of the Descriptive Statistics dialog
box, describe where and how Excel should produce the statistics.
o To indicate where the descriptive statistics that Excel
calculates should be placed: Choose from the three radio
buttons here — Output Range, New Worksheet Ply, and New
Workbook. Typically, you place the statistics onto a new
worksheet in the existing workbook. To do this, simply select
the New Worksheet Ply radio button.
o To identify what statistical measures you want
calculated: Use the Output Options check boxes. Select the
Summary Statistics check box to tell Excel to calculate
statistical measures such as mean, mode, and standard
deviation. Select the Confidence Level for Mean check box to
specify that you want a confidence level calculated for the
sample mean.
Note: If you calculate a confidence level for the sample mean,
you need to enter the confidence level percentage into the text
box provided. Use the Kth Largest and Kth Smallest check
boxes to indicate you want to find the largest or smallest value
in the data set.
After you describe where the data is and how the statistics
should be calculated, click OK. Here are the statistics that
Excel calculates.
Statistic Description

Mean Shows the arithmetic mean of the sample data.


Standard Error Shows the standard error of the data set (a measure of the
difference between the predicted value and the actual value).

Median Shows the middle value in the data set (the value that
separates the largest half of the values from the smallest half of
the values).

Mode Shows the most common value in the data set.

Standard Deviation Shows the sample standard deviation measure for the data
set.

Sample Variance Shows the sample variance for the data set (the squared
standard deviation).

Kurtosis Shows the kurtosis of the distribution.

Skewness Shows the skewness of the data set’s distribution.

Range Shows the difference between the largest and smallest values in
the data set.

Minimum Shows the smallest value in the data set.

Maximum Shows the largest value in the data set.

Sum Adds all the values in the data set together to calculate the
sum.

Count Counts the number of values in a data set.

Largest(X) Shows the largest X value in the data set.

Smallest(X) Shows the smallest X value in the data set.


Confidence Level(X) Percentage Shows the confidence level at a given percentage for the data
set values.

Here is a new worksheet with the descriptive statistics calculated.

SAMPL
ING
AND
STATIS
SAMPLING
TICAL
Sampling is a process used in statistical analysis in which a predetermined
number of observations are taken from a larger population. The methodology

INFERE
used to sample from a larger population depends on the type of analysis being
performed but may include simple random sampling or systematic sampling.

NCE
In business, a CPA performing an audit uses sampling to determine the
accuracy of account balances in the financial statements, and mana

METHOD OF
SAMPLING
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this
has to be balanced against having a large enough sample size with enough power to
detect a true association. (Calculation of sample size is addressed in section 1B
(statistics) of the Part A syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups. For example, if the electoral roll for a town
was used to identify participants, some people, such as the homeless, would not be
registered and therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalise the results from
your study. Probability sampling methods tend to be more time-consuming and
expensive than non-probability sampling. In non-probability (non-random) sampling, you
do not start with a complete sampling frame, so some individuals have no chance of
being selected. Consequently, you cannot estimate the effect of sampling error and there
is a significant risk of ending up with a non-representative sample which produces non-
generalisable results. However, non-probability sampling methods tend to be cheaper
and more convenient, and they are useful for exploratory research and hypothesis
generation.
 

Probability Sampling Methods

1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of obtaining a
random sample is to give each individual in a population a number, and then use a table
of random numbers to decide which individuals to include.1 For example, if you have a
sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from
the random number table to pick your sample. So, if the first three numbers from the
random number table were 094, select the individual labelled “94”, and so on.

As with all probability sampling methods, simple random sampling allows the sampling
error to be calculated and reduces selection bias. A specific advantage is that it is the
most straightforward method of probability sampling. A disadvantage of simple random
sampling is that you may not select enough individuals with your characteristic of
interest, especially if that characteristic is uncommon. It may also be difficult to define a
complete sampling frame and inconvenient to contact them, especially if different forms
of contact are required (email, phone, post) and your sample units are scattered over a
wide geographical area.
 

2. Systematic sampling

Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample.  For
example, if you wanted a sample size of 100 from a population of 1000, select every
1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is
easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that the
sampling technique coincides with the periodicity of the underlying pattern. As a
hypothetical example, if a group of students were being sampled to gain their opinions
on college facilities, but the Student Record Department’s central list of all students was
arranged such that the sex of students alternated between male and female, choosing
an even interval (e.g. every 20thstudent) would result in a sample of all males or all
females. Whilst in this example the bias is obvious and should be easily corrected, this
may not always be the case.
 

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample
is then obtained by taking equal sample sizes from each stratum. In stratified sampling,
it may also be appropriate to choose non-equal sample sizes from each stratum. For
example, in a study of the health outcomes of nursing staff in a county, if there are
three hospitals each with different numbers of nursing staff (hospital A has 500 nurses,
hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose
the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation
of the health outcomes of nurses across the county, whereas simple random sampling
would over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always available), and
it can be difficult to decide which characteristic(s) to stratify by.
 

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In
two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The
General Household survey, which is undertaken annually in England, is a good example
of a (one-stage) cluster sample. All members of the selected households (clusters) are
included in the survey.1
Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact
lots of individuals in a few GP practices than a few individuals in many different GP
practices. Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.
 

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants


are selected based on availability and willingness to take part. Useful results can be
obtained, but the results are prone to significant bias, because those who volunteer to
take part may be different from those who choose not to (volunteer bias), and the
sample may not be representative of other characteristics, such as age or sex. Note:
volunteer bias is a risk of all non-probability sampling methods.
 

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and
10 teenage boys so that they could interview them about their television viewing. Ideally
the quotas chosen would proportionally represent the characteristics of the underlying
population.

Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
 

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement
of the researcher when choosing who to ask to participate. Researchers may implicitly
thus choose a “representative” sample to suit their needs, or specifically approach
individuals with certain characteristics. This approach is often used by the media when
canvassing the public for opinions and in qualitative research.

Judgement sampling has the advantage of being time-and cost-effective to perform


whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by the
researcher and the findings, whilst being potentially broad, will not necessarily be
representative.
 

4. Snowball sampling

This method is commonly used in social sciences when investigating hard-to-reach


groups. Existing subjects are asked to nominate further subjects known to them, so the
sample increases in size like a rolling snowball. For example, when carrying out a survey
of risk behaviours amongst intravenous drug users, participants may be asked to
nominate other users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify.


However, by selecting friends and acquaintances of subjects already investigated, there
is a significant risk of selection bias (choosing a large number of people with similar
characteristics or views to the initial individual identified).
 

Bias in sampling

There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:1

1. Any pre-agreed sampling rules are deviated from


2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)

Advantages of sampling
Sampling ensures convenience, collection of intensive and
exhaustive data, suitability in limited resources and better rapport.
In addition to this, sampling has the following advantages also.

1. Low cost of sampling


If data were to be collected for the entire population, the cost will be
quite high. A sample is a small proportion of a population. So, the
cost will be lower if data is collected for a sample of population
which is a big advantage.

2. Less time consuming in sampling


Use of sampling takes less time also. It consumes less time than
census technique. Tabulation, analysis etc., take much less time in
the case of a sample than in the case of a population.

3. Scope of sampling is high


The investigator is concerned with the generalization of data. To
study a whole population in order to arrive at generalizations would
be impractical.

Some populations are so large that their characteristics could not be


measured. Before the measurement has been completed, the
population would have changed. But the process of sampling makes
it possible to arrive at generalizations by studying the variables
within a relatively small proportion of the population.
4. Accuracy of data is high
Having drawn a sample and computed the desired descriptive
statistics, it is possible to determine the stability of the obtained
sample value. A sample represents the population from which its is
drawn. It permits a high degree of accuracy due to a limited area of
operations. Moreover, careful execution of field work is possible.
Ultimately, the results of sampling studies turn out to be sufficiently
accurate.

5. Organization of convenience
Organizational problems involved in sampling are very few. Since
sample is of a small size, vast facilities are not required. Sampling is
therefore economical in respect of resources. Study of samples
involves less space and equipment.

6. Intensive and exhaustive data


In sample studies, measurements or observations are made of a
limited number. So, intensive and exhaustive data are collected.

7. Suitable in limited resources


The resources available within an organization may be limited.
Studying the entire universe is not viable. The population can be
satisfactorily covered through sampling. Where limited resources
exist, use of sampling is an appropriate strategy while conducting
marketing research.

8. Better rapport
An effective research study requires a good rapport between the
researcher and the respondents. When the population of the study
is large, the problem of rapport arises. But manageable samples
permit the researcher to establish adequate rapport with the
respondents.

Disadvantages of sampling
The reliability of the sample depends upon the appropriateness of
the sampling method used. The purpose of sampling theory is to
make sampling more efficient. But the real difficulties lie in
selection, estimation and administration of samples.

Disadvantages of sampling may be discussed under the heads:


 Chances of bias
 Difficulties in selecting truly a representative sample
 Need for subject specific knowledge
 changeability of sampling units
 impossibility of sampling.
1. Chances of bias
The serious limitation of the sampling method is that it involves
biased selection and thereby leads us to draw erroneous
conclusions. Bias arises when the method of selection of sample
employed is faulty. Relative small samples properly selected may be
much more reliable than large samples poorly selected.

2. Difficulties in selecting a truly representative sample


Difficulties in selecting a truly representative sample produces
reliable and accurate results only when they are representative of
the whole group. Selection of a truly representative sample is
difficult when the phenomena under study are of a complex nature.
Selecting good samples is difficult.

3. In adequate knowledge in the subject


Use of sampling method requires adequate subject specific
knowledge in sampling technique. Sampling involves statistical
analysis and calculation of probable error. When the researcher
lacks specialized knowledge in sampling, he may commit serious
mistakes. Consequently, the results of the study will be misleading.
4. Changeability of units
When the units of the population are not in homogeneous, the
sampling technique will be unscientific. In sampling, though the
number of cases is small, it is not always easy to stick to the,
selected cases. The units of sample may be widely dispersed.

Some of the cases of sample may not cooperate with the researcher
and some others may be inaccessible. Because of these problems, all
the cases may not be taken up. The selected cases may have to be
replaced by other cases. Changeability of units stands in the way of
results of the study.

5. Impossibility of sampling
Deriving a representative sample is di6icult, when the universe is
too small or too heterogeneous. In this case, census study is the only
alternative. Moreover, in studies requiring a very high standard of
accuracy, the sampling method may be unsuitable. There will be
chances of errors even if samples are drawn most carefully.

SAMPLING VS NON-
SAMPLING ERROR
Sampling error is one which occurs due to unrepresentativeness of the
sample selected for observation. Conversely, non-sampling error is an
error arise from human error, such as error in problem identification, method
or procedure used, etc.

An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.

In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.

Sampling Error Vs Non-Sampling Error


1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion

Comparison Chart

BASIS FOR NON-SAMPLING


SAMPLING ERROR
COMPARISON ERROR

Meaning Sampling error is a type of An error occurs due to


error, occurs due to the sources other than
sample selected does not sampling, while conducting
perfectly represents the survey activities is known as
population of interest. non sampling error.

Cause Deviation between sample Deficiency and analysis of


mean and population mean data
BASIS FOR NON-SAMPLING
SAMPLING ERROR
COMPARISON ERROR

Type Random Random or Non-random

Occurs Only when sample is Both in sample and census.


selected.

Sample size Possibility of error reduced It has nothing to do with the


with the increase in sample sample size.
size.

Definition of Sampling Error

Sampling Error denotes a statistical error arising out of a certain sample


selected being unrepresentative of the population of interest. In simple terms,
it is an error which occurs when the sample selected does not contain the true
characteristics, qualities or figures of the whole population.

The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.

Definition of Non-Sampling Error

Non-Sampling Error is an umbrella term which comprises of all the errors,


other than the sampling error. They arise due to a number of reasons, i.e. error
in problem definition, questionnaire design, approach, coverage, information
provided by respondents, data preparation, collection, tabulation, and
analysis.

There are two types of non-sampling error:

 Response Error: Error arising due to inaccurate answers were given


by respondents, or their answer is misinterpreted or recorded wrongly.
It consists of researcher error, respondent error and interviewer error
which are further classified as under.
o Researcher Error

 Surrogate Error
 Sampling Error
 Measurement Error
 Data Analysis Error
 Population Definition Error
 Respondent Error
 Inability Error
 Unwillingness Error
 Interviewer Error
 Questioning Error
 Recording Erro
 Respondent Selection Error
 Cheating Error
 Non-Response Error: Error arising due to some respondents
who are a part of the sample do not respond.

Key Differences Between Sampling and Non-Sampling


Error
The significant differences between sampling and non-sampling error are
mentioned in the following points:

1. Sampling error is a statistical error happens due to the sample selected


does not perfectly represents the population of interest. Non-sampling
error occurs due to sources other than sampling while conducting
survey activities is known as non-sampling error.
2. Sampling error arises because of the variation between the true mean
value for the sample and the population. On the other hand, the non-
sampling error arises because of deficiency and inappropriate analysis
of data.
3. Non-sampling error can be random or non-random whereas sampling
error occurs in the random sample only.
4. Sample error arises only when the sample is taken as a representative of
a population.As opposed to non-sampling error which arises both in
sampling and complete enumeration.
5. Sampling error is mainly associated with the sample size, i.e. as the
sample size increases the possibility of error decreases. On the contrary,
the non-sampling error is not related to the sample size, so, with the
increase in sample size, it won’t be reduced.

Conclusion
To end this discussion, it is true to say that sampling error is one which is
completely related to the sampling design and can be avoided, by expanding
the sample size. Conversely, non-sampling error is a basket that covers all the
errors other than the sampling error and so, it unavoidable by nature as it is
not possible to completely remove it.

You might also like