0% found this document useful (0 votes)

71 views36 pages

Descr Iptive Statis Tics: Inferential Statistics

Descriptive statistics are used to describe basic features of data through simple summaries. They include measures of central tendency like the mean, median, and mode, which indicate typical values, as well as the distribution and dispersion of values. The mean is the average value calculated by adding all values and dividing by the number of data points. The median is the middle value with an equal number of values above and below it. The median is less influenced by outliers than the mean. For symmetric distributions, the mean and median are similar, but for skewed distributions, the median typically provides a better indication of central tendency.

Uploaded by

Keshav Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views36 pages

Descr Iptive Statis Tics: Inferential Statistics

Uploaded by

Keshav Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 36

DESCR

IPTIVE
STATIS
Descriptive statistics are used to describe the basic features of the
data in a study. They provide simple summaries about TICS
the sample and
the measures. Together with simple graphics analysis, they form the
basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential

statistics. With descriptive statistics you are simply describing what is
or what the data shows. With inferential statistics, you are trying to
reach conclusions that extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample
data what the population might think. Or, we use inferential statistics
to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have
happened by chance in this study. Thus, we use inferential statistics
to make inferences from our data to more general conditions; we use
descriptive statistics simply to describe what's going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a

manageable form. In a research study we may have lots of measures.
Or we may measure a large number of people on any measure.
Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a
simpler summary. For instance, consider a simple number used to
summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by
the number of times at bat (reported to three significant digits). A
batter who is hitting .333 is getting a hit one time in every three at
bats. One batting .250 is hitting one time in four. The single number
describes a large number of discrete events. Or, consider the scourge
of many students, the Grade Point Average (GPA). This single
number describes the general performance of a student across a
potentially wide range of course experiences.
Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing
important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether she's been
in a slump or on a streak. The GPA doesn't tell you whether the
student was in difficult courses or easy ones, or whether they were
courses in their major field or in other disciplines. Even given these
limitations, descriptive statistics provide a powerful summary that may
enable comparisons across people or other units.

UNIVA
RIATE
ANALY
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
SIS
variable that we tend to look at:

 the distribution

 the central tendency

 the dispersion

In most situations, we would describe all three of these characteristics

for each of the variables in our study.

The Distribution. The distribution is a summary of the frequency of

individual values or ranges of values for a variable. The simplest
distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe
the distribution of college students is by year in college, listing the
number or percent of students at each of the four years. Or, we
describe gender by listing the number or percent of males and
females. In these cases, the variable has few enough values that we
can list each one and summarize how many sample cases had the
value. But what do we do for a variable like income or GPA? With
these variables there can be a large number of possible values, with
relatively few people having each one. In this case, we group the raw
scores into categories according to ranges of values. For instance, we
might look at GPA according to the letter grade ranges. Or, we might
group income into four or five ranges of income values.

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with

a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as
a table or as a graph. Table 1 shows an age frequency distribution
with five categories of age ranges defined. The same frequency
distribution can be depicted in a graph as shown in Figure 1. This type
of graph is often referred to as a histogram or bar chart.
Figure 1. Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example,

you could use percentages to describe the:

 percentage of people in different income levels

 percentage of people in different age ranges

 percentage of people in different ranges of standardized test scores

MEAS
URE
OF
A measure of central tendency is a summary statistic that represents the centre point

CENTR
or typical value of a dataset. These measures indicate where most values in a
distribution fall and are also referred to as the central location of a distribution. You
can think of it as the tendency of data to cluster around a middle value. In statistics,

AL
the three most common measures of central tendency are the mean, median,
and mode. Each of these measures calculates the location of the central point using
a different method.

TENDE
Choosing the best measure of central tendency depends on the type of data you

NCY
have. In this post, I explore these measures of central tendency, show you how to
calculate them, and how to determine which one is best for your data.

Mean
The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just
add up all of the values and divide by the number of observations in your dataset.

The calculation of the mean incorporates all values in the data. If you change any
value, the mean changes. However, the mean doesn’t always locate the center of
the data accurately. Observe the histograms below where I display the mean in the
distributions.

In a symmetric distribution, the mean locates the center accurately.

However, in a skewed distribution, the mean can miss the mark. In the histogram
above, it is starting to fall outside the central area. This problem occurs
because outliershave a substantial impact on the mean. Extreme values in an
extended tail pull the mean away from the center. As the distribution becomes more
skewed, the mean is drawn further away from the center. Consequently, it’s best to
use the mean as a measure of the central tendency when you have a symmetric
distribution.

When to use the mean: Symmetric distribution, Continuous data

When to use the median: Skewed distribution, Continuous data, Ordinal data

Mode
The mode is the value that occurs the most frequently in your data set. On a bar
chart, the mode is the highest bar. If the data have multiple values that are tied for
occurring the most frequently, you have a multimodal distribution. If no value
repeats, the data do not have a mode.

In the dataset below, the value 5 occurs most frequently, which makes it the mode.
These data might represent a 5-point Likert scale.
Typically, you use the mode with categorical, ordinal, and discrete data. In fact, the
mode is the only measure of central tendency that you can use with categorical data
—such as the most preferred flavor of ice cream. However, with categorical data,
there isn’t a central value because you can’t order the groups. With ordinal and
discrete data, the mode can be a value that is not in the center. Again, the mode
represents the most common value.

In the graph of service quality, Very Satisfied is the mode of this distribution because
it is the most common value in the data. Notice how it is at the extreme end of the
distribution. I’m sure the service providers are pleased with these results!

Finding the mode for continuous data

In the continuous data below, no values repeat, which means there is no mode. With
continuous data, it is unlikely that two or more values will be exactly equal because
there are an infinite number of values between any two values.

When you are working with the raw continuous data, don’t be surprised if there is no
mode. However, you can find the mode for continuous data by locating the maximum
value on a probability distribution plot. If you can identify a probability distribution that
fits your data, find the peak value and use it as the mode.

The probability distribution plot displays a lognormal distribution that has a mode of
16700. This distribution corresponds to the U.S. household income example in the
median section.
When to use the mode: Categorical data, Ordinal data, Count data, Probability
Distributions

Which is Best—the Mean, Median, or Mode?

When you have a symmetrical distribution for continuous data, the mean, median,
and mode are equal. In this case, analysts tend to use the mean because it includes
all of the data in the calculations. However, if you have a skewed distribution, the
median is often the best measure of central tendency.

When you have ordinal data, the median or mode is usually the best choice. For
categorical data, you have to use the mode.

In cases where you are deciding between the mean and median as the better
measure of central tendency, you are also determining which types of
statistical hypothesis testsare appropriate for your data—if that is your ultimate goal.
I have written an article that discusses when to use parametric (mean) and

MEAS
nonparametric (median) hypothesis testsalong with the advantages .

URE
OF
As the name suggests, the measure of dispersion shows the
scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The
DISPE
measure of dispersion shows the homogeneity or the heterogeneity of
the distribution of the observations.
RSION
Suppose you have four datasets of the same size and the mean is also
same, say, m. In all the cases the sum of the observations will be the
METH
same. Here, the measure of central tendency is not giving a clear and
complete idea about the distribution for the four given sets.
OD OF
Can we get an idea about the distribution if we get to know about the
MEAS
dispersion of the observations from one another within and between
the datasets? The main idea about the measure of dispersion is to get
URE
to know how the data are spread. It shows how much the data vary
from their average value.
OF
DISPE
RSION
By now, you must have come across or learnt different measures of central
tendency. Measures of central tendency facilitate the representation of the entire
mass of the data with a single value. Can the central tendency describe the data
wholly and accurately? No, and that is precisely why we need measures of
dispersion. For instance, the hourly income of professionals in two offices are:

Office A : 30 50 50 65 70 90 100

Office B : 60 60 70 65 65 65 70

Here, evidently, the mean of both the sections is the same, that is, 65

 In office A, the observations are much more away from the mean.
 In office B, almost all the observations are close to the mean. Certainly,
both the offices differ even though their mean is the same.
Therefore it is required to differentiate between the groups. We need some other
measures with regards to the measure of scattered-ness (or spread). For this
purpose, we study this topic known as measures of dispersion.

In simple words, ‘dispersion’ is a lack of uniformity in the sizes or quantities of the

items of a group or series. According to Reiglemen, “Dispersion is the extent to
which the magnitudes or quantities of the items differ, the degree of diversity.” The
word may also be used to address the spread of the data.

Types of Dispersion
The measures of dispersion can be ‘absolute’ or ‘relative’. In the case of absolute
measures of dispersion, they are stated in the same units in which the original data
is expressed. For instance, if a group of data expresses the number of shoes a group
of people own; the absolute dispersion will provide the values in numbers.

Relative dispersion, on the other hand, is the ratio of a measure of absolute

dispersion to an appropriate average. The main benefit of this measure is that two
or more series can be compared with each other even if they are expressed in
different units.
Methods of Dispersion
Methods of studying dispersion are divided into two types :

(i) Mathematical Methods: We can study the ‘degree’ and ‘extent’ of variation with
the use of these methods. The measures of dispersion included in this category are :

(a) Range

(b) Quartile Deviation

(c) Average Deviation

(d) Standard deviation and coefficient of variation.

(ii) Graphic Methods: If only the extent of variation is studied, whether it is higher
or lower, a Lorenz-curve is put to use.

Locating the Center of Your Data

Most articles that you’ll read about the mean, median, and mode focus on how you
calculate each one. I’m going to take a slightly different approach to start out. My
philosophy throughout my blog is to help you intuitively grasp statistics by focusing
on concepts. Consequently, I’m going to start by illustrating the central point of
several datasets graphically—so you understand the goal. Then, we’ll move on to
choosing the best measure of central tendency for your data and the calculations.

The three distributions below represent different data conditions. In each distribution,
look for the region where the most common values fall. Even though the shapes and
type of data are different, you can find that central location. That’s the area in the
distribution where the most common values are located.
As the graphs highlight, you can see where most values tend to occur. That’s the
concept. Measures of central tendency represent this idea with a value. Coming up,
you’ll learn that as the distribution and kind of data changes, so does the best
measure of central tendency. Consequently, you need to know the type of data you
have, and graph it, before choosing a measure of central tendency!

The central tendency of a distribution represents one characteristic of a distribution.

Another aspect is the variability around that central value. While measures of
variability is the topic of a different article (link below), this property describes how far
away the data points tend to fall from the center. The graph below shows how
distributions with the same central tendency (mean = 100) can actually be quite
different. The panel on the left displays a distribution that is tightly clustered around
the mean, while the distribution on the right is more spread out. It is crucial
to understand that the central tendency summarizes only one aspect of a
distribution and that it provides an incomplete picture by itself.
Related post: Measures of Variability: Range, Interquartile Range, Variance, and
Standard Deviation

Mathematical Methods

 Range
Two sections of 10 students each in class XII in a school were given a common
test in Economics (40 maximum marks). The scores of the students are given
below:

Section A: 6 9 11 13 15 21 23 28 29 35

Section B: 15 16 16 17 18 19 20 21 23 25

The average score in section A is 19. The average score in section B is 19.

In the above cited example, we observe that:

 the scores of all the students in section A are ranging from 6 to 35;
 the scores of the students in section B are ranging from 15 to 25.
The difference between the largest and the smallest scores in section A is 29 (35-6)
The difference between the largest and smallest scores in section B is 10 (25-15)

Thus, the difference between the largest and the smallest value of a data, is termed
as the range of the distribution. Range does not consider all the values of a series,
i.e. it takes only the extreme items and middle items are not considered significant.
Therefore, Range is not sufficient to explain about the character of the distribution.
The concept of range is useful in the field of quality control and to study the
variations in the prices of the shares etc.

 Quartile Deviation
The quartile deviation is a slightly better measure of absolute dispersion than the
range, although it ignores the observations on the ends (tails). It helps in knowing
the range within which certain proportion of the items fall lay. It only considers the
values of the ‘Upper quartile’ (Q3) and the ‘Lower quartile’ (Q1).

Inter Quartile Range = Q3 – Q1 .

The Inter-Quartile Range is based upon the 50% of the values in a distribution
which lay in the middle; and hence is unaffected by extreme values. Half of the
Inter-Quartile Range is called Quartile Deviation (Q.D.).

Thus: Q .D . = (Q3 – Q1)/2

Q.D. is therefore also called Semi Inter Quartile Range.

In individual and discrete series, Q1 is the size of [(n +1)/ 4] th value, but in a
continuous distribution, it is the size of n/4 th value. Similarly, for Q3 and median
also, n is used in place of n+1.

A relative measure of dispersion based on the quartile deviation is called the

coefficient of quartile deviation. It is just a number without any units of
measurement. It can be used for comparing the dispersion of two or more sets of
data.

 Average Deviation
Average deviation can be defined as the arithmetic mean of the absolute deviations
(ignoring the negative signs) of various items from Mean, Mode or Median.

Calculation of mean deviation:

Individual Series

Discrete Series

Continuous Series

Where,

MD = Mean deviation

| D | = Deviations from mean or median ignoring + Signs

N = Number of item (Individual Series)

N = Total number of Frequencies (Discrete and continuous series)

F = Number of frequencies.

 Standard Deviation
Standard deviation is one of the best and popularly used measures of dispersion.
Standard deviation is the square root of the arithmetic mean of the squares of
deviation of its items from their arithmetic mean. The concept of standard
deviation, which was introduced by Karl Pearson is useful in assessing the
representativeness of the mean. It has a practical significance because it does not
come with the problems associated with a range, quartile deviation or average
deviation.

Calculations for the same, are as under:

Individual Series:
1. Actual Mean Method

2. Assumed Mean Method

Discrete/Continuous Series:
1. Actual Mean Method

2. Assumed Mean Method

3. Step Deviation Method

x C
Coefficient of variation:
This Is the most apt measure when two or more groups of similar data are to be
compared with respect to stability (or uniformly or consistency or homogeneity). It
is the ratio of the standard deviation to the mean.

Where C.V. = Coefficient of variation

σ= Standard deviation
X = Arithmetic mean

Graphical Methods
Lorenz Curve: A Lorenz Curve can be defined as a graph on which the
cumulative percentage of total national income (or some other variable) is plotted
against the cumulative percentage of the corresponding population (ranked in
increasing size of share). The extent to which the curve sags below a straight
DESCR
diagonal line indicates the degree of inequality of distribution.

IPTIVE
STATI
STICS
Perhaps the most common Data Analysis tool that you’ll use in

USING
Excel is the one for calculating descriptive statistics. To see how
this works, take a look at this worksheet. It summarizes sales data
for a book publisher. MS
EXCEL
In column A, the worksheet shows the suggested retail price (SRP).
In column B, the worksheet shows the units sold of each book
through one popular bookselling outlet. You might choose to use
the Descriptive Statistics tool to summarize this data set.
To calculate descriptive statistics for the data set, follow these
steps:
1. Click the Data tab’s Data Analysis command button to tell
Excel that you want to calculate descriptive statistics.
Excel displays the Data Analysis dialog box.

2. In Data Analysis dialog box, highlight the Descriptive Statistics

entry in the Analysis Tools list and then click OK.
Excel displays the Descriptive Statistics dialog box.
3. In the Input section of the Descriptive Statistics dialog box,
identify the data that you want to describe.
o To identify the data that you want to describe
statistically: Click the Input Range text box and then enter the
worksheet range reference for the data. In the case of the
example worksheet, the input range is $A$1:$C$38. Note that
Excel wants the range address to use absolute references —
hence, the dollar signs.

To make it easier to see or select the worksheet range, click

the worksheet button at the right end of the Input Range text
box. When Excel hides the Descriptive Statistics dialog box,
select the range that you want by dragging the mouse. Then
click the worksheet button again to redisplay the Descriptive
Statistics dialog box.
o To identify whether the data is arranged in columns or
rows: Select either the Columns or the Rows radio button.
o To indicate whether the first row holds labels that
describe the data:Select the Labels in First Row check box. In
the case of the example worksheet, the data is arranged in
columns, and the first row does hold labels, so you select the
Columns radio button and the Labels in First Row check box.
4. In the Output Options area of the Descriptive Statistics dialog
box, describe where and how Excel should produce the statistics.
o To indicate where the descriptive statistics that Excel
calculates should be placed: Choose from the three radio
buttons here — Output Range, New Worksheet Ply, and New
Workbook. Typically, you place the statistics onto a new
worksheet in the existing workbook. To do this, simply select
the New Worksheet Ply radio button.
o To identify what statistical measures you want
calculated: Use the Output Options check boxes. Select the
Summary Statistics check box to tell Excel to calculate
statistical measures such as mean, mode, and standard
deviation. Select the Confidence Level for Mean check box to
specify that you want a confidence level calculated for the
sample mean.
Note: If you calculate a confidence level for the sample mean,
you need to enter the confidence level percentage into the text
box provided. Use the Kth Largest and Kth Smallest check
boxes to indicate you want to find the largest or smallest value
in the data set.
After you describe where the data is and how the statistics
should be calculated, click OK. Here are the statistics that
Excel calculates.
Statistic Description

Mean Shows the arithmetic mean of the sample data.

Standard Error Shows the standard error of the data set (a measure of the
difference between the predicted value and the actual value).

Median Shows the middle value in the data set (the value that
separates the largest half of the values from the smallest half of
the values).

Mode Shows the most common value in the data set.

Standard Deviation Shows the sample standard deviation measure for the data
set.

Sample Variance Shows the sample variance for the data set (the squared
standard deviation).

Kurtosis Shows the kurtosis of the distribution.

Skewness Shows the skewness of the data set’s distribution.

Range Shows the difference between the largest and smallest values in
the data set.

Minimum Shows the smallest value in the data set.

Maximum Shows the largest value in the data set.

Sum Adds all the values in the data set together to calculate the
sum.

Count Counts the number of values in a data set.

Largest(X) Shows the largest X value in the data set.

Smallest(X) Shows the smallest X value in the data set.

Confidence Level(X) Percentage Shows the confidence level at a given percentage for the data
set values.

Here is a new worksheet with the descriptive statistics calculated.

SAMPL
ING
AND
STATIS
SAMPLING
TICAL
Sampling is a process used in statistical analysis in which a predetermined
number of observations are taken from a larger population. The methodology

INFERE
used to sample from a larger population depends on the type of analysis being
performed but may include simple random sampling or systematic sampling.

NCE
In business, a CPA performing an audit uses sampling to determine the
accuracy of account balances in the financial statements, and mana

METHOD OF
SAMPLING
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this
has to be balanced against having a large enough sample size with enough power to
detect a true association. (Calculation of sample size is addressed in section 1B
(statistics) of the Part A syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups. For example, if the electoral roll for a town
was used to identify participants, some people, such as the homeless, would not be
registered and therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalise the results from
your study. Probability sampling methods tend to be more time-consuming and
expensive than non-probability sampling. In non-probability (non-random) sampling, you
do not start with a complete sampling frame, so some individuals have no chance of
being selected. Consequently, you cannot estimate the effect of sampling error and there
is a significant risk of ending up with a non-representative sample which produces non-
generalisable results. However, non-probability sampling methods tend to be cheaper
and more convenient, and they are useful for exploratory research and hypothesis
generation.

Probability Sampling Methods

1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of obtaining a
random sample is to give each individual in a population a number, and then use a table
of random numbers to decide which individuals to include.1 For example, if you have a
sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from
the random number table to pick your sample. So, if the first three numbers from the
random number table were 094, select the individual labelled “94”, and so on.

As with all probability sampling methods, simple random sampling allows the sampling
error to be calculated and reduces selection bias. A specific advantage is that it is the
most straightforward method of probability sampling. A disadvantage of simple random
sampling is that you may not select enough individuals with your characteristic of
interest, especially if that characteristic is uncommon. It may also be difficult to define a
complete sampling frame and inconvenient to contact them, especially if different forms
of contact are required (email, phone, post) and your sample units are scattered over a
wide geographical area.

2. Systematic sampling

Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample. For
example, if you wanted a sample size of 100 from a population of 1000, select every
1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is
easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that the
sampling technique coincides with the periodicity of the underlying pattern. As a
hypothetical example, if a group of students were being sampled to gain their opinions
on college facilities, but the Student Record Department’s central list of all students was
arranged such that the sex of students alternated between male and female, choosing
an even interval (e.g. every 20thstudent) would result in a sample of all males or all
females. Whilst in this example the bias is obvious and should be easily corrected, this
may not always be the case.

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample
is then obtained by taking equal sample sizes from each stratum. In stratified sampling,
it may also be appropriate to choose non-equal sample sizes from each stratum. For
example, in a study of the health outcomes of nursing staff in a county, if there are
three hospitals each with different numbers of nursing staff (hospital A has 500 nurses,
hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose
the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation
of the health outcomes of nurses across the county, whereas simple random sampling
would over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always available), and
it can be difficult to decide which characteristic(s) to stratify by.

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In
two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The
General Household survey, which is undertaken annually in England, is a good example
of a (one-stage) cluster sample. All members of the selected households (clusters) are
included in the survey.1
Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact
lots of individuals in a few GP practices than a few individuals in many different GP
practices. Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants

are selected based on availability and willingness to take part. Useful results can be
obtained, but the results are prone to significant bias, because those who volunteer to
take part may be different from those who choose not to (volunteer bias), and the
sample may not be representative of other characteristics, such as age or sex. Note:
volunteer bias is a risk of all non-probability sampling methods.

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and
10 teenage boys so that they could interview them about their television viewing. Ideally
the quotas chosen would proportionally represent the characteristics of the underlying
population.

Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement
of the researcher when choosing who to ask to participate. Researchers may implicitly
thus choose a “representative” sample to suit their needs, or specifically approach
individuals with certain characteristics. This approach is often used by the media when
canvassing the public for opinions and in qualitative research.

Judgement sampling has the advantage of being time-and cost-effective to perform

whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by the
researcher and the findings, whilst being potentially broad, will not necessarily be
representative.

4. Snowball sampling

This method is commonly used in social sciences when investigating hard-to-reach

groups. Existing subjects are asked to nominate further subjects known to them, so the
sample increases in size like a rolling snowball. For example, when carrying out a survey
of risk behaviours amongst intravenous drug users, participants may be asked to
nominate other users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify.

However, by selecting friends and acquaintances of subjects already investigated, there
is a significant risk of selection bias (choosing a large number of people with similar
characteristics or views to the initial individual identified).

Bias in sampling

There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:1

1. Any pre-agreed sampling rules are deviated from

2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)

Advantages of sampling
Sampling ensures convenience, collection of intensive and
exhaustive data, suitability in limited resources and better rapport.
In addition to this, sampling has the following advantages also.

1. Low cost of sampling

If data were to be collected for the entire population, the cost will be
quite high. A sample is a small proportion of a population. So, the
cost will be lower if data is collected for a sample of population
which is a big advantage.

2. Less time consuming in sampling

Use of sampling takes less time also. It consumes less time than
census technique. Tabulation, analysis etc., take much less time in
the case of a sample than in the case of a population.

3. Scope of sampling is high

The investigator is concerned with the generalization of data. To
study a whole population in order to arrive at generalizations would
be impractical.

Some populations are so large that their characteristics could not be

measured. Before the measurement has been completed, the
population would have changed. But the process of sampling makes
it possible to arrive at generalizations by studying the variables
within a relatively small proportion of the population.
4. Accuracy of data is high
Having drawn a sample and computed the desired descriptive
statistics, it is possible to determine the stability of the obtained
sample value. A sample represents the population from which its is
drawn. It permits a high degree of accuracy due to a limited area of
operations. Moreover, careful execution of field work is possible.
Ultimately, the results of sampling studies turn out to be sufficiently
accurate.

5. Organization of convenience
Organizational problems involved in sampling are very few. Since
sample is of a small size, vast facilities are not required. Sampling is
therefore economical in respect of resources. Study of samples
involves less space and equipment.

6. Intensive and exhaustive data

In sample studies, measurements or observations are made of a
limited number. So, intensive and exhaustive data are collected.

7. Suitable in limited resources

The resources available within an organization may be limited.
Studying the entire universe is not viable. The population can be
satisfactorily covered through sampling. Where limited resources
exist, use of sampling is an appropriate strategy while conducting
marketing research.

8. Better rapport
An effective research study requires a good rapport between the
researcher and the respondents. When the population of the study
is large, the problem of rapport arises. But manageable samples
permit the researcher to establish adequate rapport with the
respondents.

Disadvantages of sampling
The reliability of the sample depends upon the appropriateness of
the sampling method used. The purpose of sampling theory is to
make sampling more efficient. But the real difficulties lie in
selection, estimation and administration of samples.

Disadvantages of sampling may be discussed under the heads:

 Chances of bias
 Difficulties in selecting truly a representative sample
 Need for subject specific knowledge
 changeability of sampling units
 impossibility of sampling.
1. Chances of bias
The serious limitation of the sampling method is that it involves
biased selection and thereby leads us to draw erroneous
conclusions. Bias arises when the method of selection of sample
employed is faulty. Relative small samples properly selected may be
much more reliable than large samples poorly selected.

2. Difficulties in selecting a truly representative sample

Difficulties in selecting a truly representative sample produces
reliable and accurate results only when they are representative of
the whole group. Selection of a truly representative sample is
difficult when the phenomena under study are of a complex nature.
Selecting good samples is difficult.

3. In adequate knowledge in the subject

Use of sampling method requires adequate subject specific
knowledge in sampling technique. Sampling involves statistical
analysis and calculation of probable error. When the researcher
lacks specialized knowledge in sampling, he may commit serious
mistakes. Consequently, the results of the study will be misleading.
4. Changeability of units
When the units of the population are not in homogeneous, the
sampling technique will be unscientific. In sampling, though the
number of cases is small, it is not always easy to stick to the,
selected cases. The units of sample may be widely dispersed.

Some of the cases of sample may not cooperate with the researcher
and some others may be inaccessible. Because of these problems, all
the cases may not be taken up. The selected cases may have to be
replaced by other cases. Changeability of units stands in the way of
results of the study.

5. Impossibility of sampling
Deriving a representative sample is di6icult, when the universe is
too small or too heterogeneous. In this case, census study is the only
alternative. Moreover, in studies requiring a very high standard of
accuracy, the sampling method may be unsuitable. There will be
chances of errors even if samples are drawn most carefully.

SAMPLING VS NON-
SAMPLING ERROR
Sampling error is one which occurs due to unrepresentativeness of the
sample selected for observation. Conversely, non-sampling error is an
error arise from human error, such as error in problem identification, method
or procedure used, etc.

An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.

In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.

Sampling Error Vs Non-Sampling Error

1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion

Comparison Chart

BASIS FOR NON-SAMPLING

SAMPLING ERROR
COMPARISON ERROR

Meaning Sampling error is a type of An error occurs due to

error, occurs due to the sources other than
sample selected does not sampling, while conducting
perfectly represents the survey activities is known as
population of interest. non sampling error.

Cause Deviation between sample Deficiency and analysis of

mean and population mean data
BASIS FOR NON-SAMPLING
SAMPLING ERROR
COMPARISON ERROR

Type Random Random or Non-random

Occurs Only when sample is Both in sample and census.

selected.

Sample size Possibility of error reduced It has nothing to do with the

with the increase in sample sample size.
size.

Definition of Sampling Error

Sampling Error denotes a statistical error arising out of a certain sample

selected being unrepresentative of the population of interest. In simple terms,
it is an error which occurs when the sample selected does not contain the true
characteristics, qualities or figures of the whole population.

The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.

Definition of Non-Sampling Error

Non-Sampling Error is an umbrella term which comprises of all the errors,

other than the sampling error. They arise due to a number of reasons, i.e. error
in problem definition, questionnaire design, approach, coverage, information
provided by respondents, data preparation, collection, tabulation, and
analysis.

There are two types of non-sampling error:

 Response Error: Error arising due to inaccurate answers were given

by respondents, or their answer is misinterpreted or recorded wrongly.
It consists of researcher error, respondent error and interviewer error
which are further classified as under.
o Researcher Error

 Surrogate Error
 Sampling Error
 Measurement Error
 Data Analysis Error
 Population Definition Error
 Respondent Error
 Inability Error
 Unwillingness Error
 Interviewer Error
 Questioning Error
 Recording Erro
 Respondent Selection Error
 Cheating Error
 Non-Response Error: Error arising due to some respondents
who are a part of the sample do not respond.

Key Differences Between Sampling and Non-Sampling

Error
The significant differences between sampling and non-sampling error are
mentioned in the following points:

1. Sampling error is a statistical error happens due to the sample selected

does not perfectly represents the population of interest. Non-sampling
error occurs due to sources other than sampling while conducting
survey activities is known as non-sampling error.
2. Sampling error arises because of the variation between the true mean
value for the sample and the population. On the other hand, the non-
sampling error arises because of deficiency and inappropriate analysis
of data.
3. Non-sampling error can be random or non-random whereas sampling
error occurs in the random sample only.
4. Sample error arises only when the sample is taken as a representative of
a population.As opposed to non-sampling error which arises both in
sampling and complete enumeration.
5. Sampling error is mainly associated with the sample size, i.e. as the
sample size increases the possibility of error decreases. On the contrary,
the non-sampling error is not related to the sample size, so, with the
increase in sample size, it won’t be reduced.

Conclusion
To end this discussion, it is true to say that sampling error is one which is
completely related to the sampling design and can be avoided, by expanding
the sample size. Conversely, non-sampling error is a basket that covers all the
errors other than the sampling error and so, it unavoidable by nature as it is
not possible to completely remove it.

Unit 4 Class Notes
No ratings yet
Unit 4 Class Notes
58 pages
Business Statistics and Analysis All Topics in One PDF Aktu Notes
No ratings yet
Business Statistics and Analysis All Topics in One PDF Aktu Notes
42 pages
Data Analysis Techniques
No ratings yet
Data Analysis Techniques
12 pages
ASA Notes
No ratings yet
ASA Notes
28 pages
Biostatics Course
No ratings yet
Biostatics Course
29 pages
Assignment#8614 2
No ratings yet
Assignment#8614 2
37 pages
Data Presentation
No ratings yet
Data Presentation
104 pages
5630-1 Final
No ratings yet
5630-1 Final
15 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Statistics Theory
No ratings yet
Statistics Theory
3 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Chapter IV Data Exploration and Visualization
No ratings yet
Chapter IV Data Exploration and Visualization
3 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
32 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
6 pages
Descreptive Statistics 1
No ratings yet
Descreptive Statistics 1
74 pages
Descriptive Statistics Inferential Statistics Standard Deviation Confidence Interval The T-Test Correlation
No ratings yet
Descriptive Statistics Inferential Statistics Standard Deviation Confidence Interval The T-Test Correlation
14 pages
PUPSPC BUMA30063 - Chapter 2 Instructional Material
No ratings yet
PUPSPC BUMA30063 - Chapter 2 Instructional Material
10 pages
4689-2 Final
No ratings yet
4689-2 Final
11 pages
MMW Finals Reviewer
No ratings yet
MMW Finals Reviewer
9 pages
8614 Saba 2nd
No ratings yet
8614 Saba 2nd
44 pages
LabModule - Exploratory Data Analysis - 2023ic
No ratings yet
LabModule - Exploratory Data Analysis - 2023ic
24 pages
Introduction To Statistics
100% (1)
Introduction To Statistics
60 pages
ECA-II Question Papers
No ratings yet
ECA-II Question Papers
86 pages
Module 5 Ge 114
No ratings yet
Module 5 Ge 114
15 pages
Descriptive Statistics MBA
100% (2)
Descriptive Statistics MBA
7 pages
Bioepi Lesson 6. Descriptive Statistics
No ratings yet
Bioepi Lesson 6. Descriptive Statistics
38 pages
Unit 1V
No ratings yet
Unit 1V
3 pages
Chapter 2 Descriptive Statistics
100% (2)
Chapter 2 Descriptive Statistics
15 pages
Statistics Summary
No ratings yet
Statistics Summary
19 pages
Statistics
No ratings yet
Statistics
11 pages
AOL 1 Chapter Chapter 7 Part 1
No ratings yet
AOL 1 Chapter Chapter 7 Part 1
10 pages
WWW Social Research Methods Net KB Statdesc PHP
100% (1)
WWW Social Research Methods Net KB Statdesc PHP
87 pages
Statistics & Probability: Quarter 3 - Module 3: Mean and Variance of A Discrete Random Variable
67% (3)
Statistics & Probability: Quarter 3 - Module 3: Mean and Variance of A Discrete Random Variable
19 pages
Catálogo LGC Standards
No ratings yet
Catálogo LGC Standards
988 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
21 pages
Locskew
No ratings yet
Locskew
8 pages
Study Guide
No ratings yet
Study Guide
16 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
14 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
9 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
6 pages
SPSS Session 1 Descriptive Statistics and Univariate
No ratings yet
SPSS Session 1 Descriptive Statistics and Univariate
8 pages
Data Management
No ratings yet
Data Management
81 pages
Presentation 4
No ratings yet
Presentation 4
29 pages
Descriptive
No ratings yet
Descriptive
3 pages
Assignment
No ratings yet
Assignment
23 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
Bank Fraud and Its Effect On Nigerian Economy A Study of Selected Quoted Banks
No ratings yet
Bank Fraud and Its Effect On Nigerian Economy A Study of Selected Quoted Banks
17 pages
Instructions For Chapter 3 Prepared by Dr. Guru-Gharana: Terminology and Conventions
No ratings yet
Instructions For Chapter 3 Prepared by Dr. Guru-Gharana: Terminology and Conventions
11 pages
Instant Access To (Ebook PDF) Statistics For Business and Economics by Carlos Cortinha Ebook Full Chapters
100% (1)
Instant Access To (Ebook PDF) Statistics For Business and Economics by Carlos Cortinha Ebook Full Chapters
55 pages
Managerial Economics, Allen, CH 14 Test Bank
100% (1)
Managerial Economics, Allen, CH 14 Test Bank
11 pages
Module 1c - Measures of Central Tendency, Position and Variability
No ratings yet
Module 1c - Measures of Central Tendency, Position and Variability
74 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
Statistics Notes
No ratings yet
Statistics Notes
16 pages
Fabm 11 Lessons
No ratings yet
Fabm 11 Lessons
88 pages
I. M. Sobol - The Monte Carlo Method (Popular Lectures in Mathematics) MIR
No ratings yet
I. M. Sobol - The Monte Carlo Method (Popular Lectures in Mathematics) MIR
72 pages
Risks and Rate of Return
100% (1)
Risks and Rate of Return
79 pages
Basics For Understanding
No ratings yet
Basics For Understanding
8 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Assignment
No ratings yet
Assignment
30 pages
Exercises
100% (1)
Exercises
37 pages
Submitted To Submitted by
No ratings yet
Submitted To Submitted by
44 pages
Module 005 - Descriptive Statistics
No ratings yet
Module 005 - Descriptive Statistics
13 pages
Ams L 01 Introduction
No ratings yet
Ams L 01 Introduction
38 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
9 pages
1 Descriptive Statistics - Unlocked
No ratings yet
1 Descriptive Statistics - Unlocked
18 pages
KESHAV
No ratings yet
KESHAV
40 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
6 pages
Revision Exercise 2024 (FIA3291)
No ratings yet
Revision Exercise 2024 (FIA3291)
4 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
5 pages
Final Thesis-Group-1
No ratings yet
Final Thesis-Group-1
100 pages
Ease of Living: Assessment Framework
No ratings yet
Ease of Living: Assessment Framework
76 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
2 pages
Biostatistics CN
No ratings yet
Biostatistics CN
79 pages
G103 - A2LA Guide For Estimation of Uncertainty of Dimensional Calibration and Testing Results-5658-3
No ratings yet
G103 - A2LA Guide For Estimation of Uncertainty of Dimensional Calibration and Testing Results-5658-3
30 pages
Journal of Healthcare Engineering - 2019 - Al-Qatawneh - Six Sigma Application in Healthcare Logistics A Framework and A
No ratings yet
Journal of Healthcare Engineering - 2019 - Al-Qatawneh - Six Sigma Application in Healthcare Logistics A Framework and A
12 pages
Maths For Computing
No ratings yet
Maths For Computing
51 pages
IQC Planning Implementation
No ratings yet
IQC Planning Implementation
19 pages
Chap 7 Statistics - Measures of Dispersion - 2
No ratings yet
Chap 7 Statistics - Measures of Dispersion - 2
18 pages
Statistics
No ratings yet
Statistics
4 pages
Lab Practical 1 (Moisture Content)
No ratings yet
Lab Practical 1 (Moisture Content)
7 pages
Nonlinear Modeling Parameters and Acceptance Criteria For Concrete Columns
No ratings yet
Nonlinear Modeling Parameters and Acceptance Criteria For Concrete Columns
28 pages
Submitted To Submitted by
No ratings yet
Submitted To Submitted by
44 pages
Name - Class - Fees - Fines - Total - : Commerce Classes
No ratings yet
Name - Class - Fees - Fines - Total - : Commerce Classes
1 page
Audit Sampling: An Application To Substantive Tests of Account Balances
No ratings yet
Audit Sampling: An Application To Substantive Tests of Account Balances
18 pages
Powerpoint Presentation: Name-Jitender Roll No-1322010046 Subject - Project Planning and Management
No ratings yet
Powerpoint Presentation: Name-Jitender Roll No-1322010046 Subject - Project Planning and Management
15 pages
Uniform Dist Ex
No ratings yet
Uniform Dist Ex
4 pages
Analytical Chemistry (Theory)
No ratings yet
Analytical Chemistry (Theory)
10 pages
Submitted To Submitted by
No ratings yet
Submitted To Submitted by
5 pages
Me, My Selfie, and I: The Relationship Between Editing and Posting Selfies and Body Dissatisfaction in Men and Women
No ratings yet
Me, My Selfie, and I: The Relationship Between Editing and Posting Selfies and Body Dissatisfaction in Men and Women
5 pages
CEM 515 SPC Quiz Student Name: - Student No
No ratings yet
CEM 515 SPC Quiz Student Name: - Student No
3 pages
2011 Mathematics HCI Prelim Paper 2
No ratings yet
2011 Mathematics HCI Prelim Paper 2
5 pages
Aerodynamic Properties of
No ratings yet
Aerodynamic Properties of
5 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet

Descr Iptive Statis Tics: Inferential Statistics

Uploaded by

Descr Iptive Statis Tics: Inferential Statistics

Uploaded by

DESCR

Descriptive statistics are typically distinguished from inferential

Descriptive Statistics are used to present quantitative descriptions in a

 the central tendency

In most situations, we would describe all three of these characteristics

The Distribution. The distribution is a summary of the frequency of

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with

Distributions may also be displayed using percentages. For example,

 percentage of people in different income levels

 percentage of people in different age ranges

 percentage of people in different ranges of standardized test scores

In a symmetric distribution, the mean locates the center accurately.

When to use the mean: Symmetric distribution, Continuous data

Related post: Using Histograms to Understand Your Data

When to use the median: Skewed distribution, Continuous data, Ordinal data

Finding the mode for continuous data

Which is Best—the Mean, Median, or Mode?

Office A : 30 50 50 65 70 90 100

Office B : 60 60 70 65 65 65 70

In simple words, ‘dispersion’ is a lack of uniformity in the sizes or quantities of the

Relative dispersion, on the other hand, is the ratio of a measure of absolute

(b) Quartile Deviation

(c) Average Deviation

(d) Standard deviation and coefficient of variation.

Locating the Center of Your Data

Related posts: Guide to Data Types and How to Graph Them

The central tendency of a distribution represents one characteristic of a distribution.

Section A: 6 9 11 13 15 21 23 28 29 35

Section B: 15 16 16 17 18 19 20 21 23 25

In the above cited example, we observe that:

Inter Quartile Range = Q3 – Q1 .

Thus: Q .D . = (Q3 – Q1)/2

Q.D. is therefore also called Semi Inter Quartile Range.

A relative measure of dispersion based on the quartile deviation is called the

Calculation of mean deviation:

| D | = Deviations from mean or median ignoring + Signs

N = Number of item (Individual Series)

N = Total number of Frequencies (Discrete and continuous series)

Calculations for the same, are as under:

2. Assumed Mean Method

2. Assumed Mean Method

3. Step Deviation Method

Where C.V. = Coefficient of variation

2. In Data Analysis dialog box, highlight the Descriptive Statistics

To make it easier to see or select the worksheet range, click

Mean Shows the arithmetic mean of the sample data.

Mode Shows the most common value in the data set.

Kurtosis Shows the kurtosis of the distribution.

Skewness Shows the skewness of the data set’s distribution.

Minimum Shows the smallest value in the data set.

Maximum Shows the largest value in the data set.

Count Counts the number of values in a data set.

Largest(X) Shows the largest X value in the data set.

Smallest(X) Shows the smallest X value in the data set.

Here is a new worksheet with the descriptive statistics calculated.

Probability Sampling Methods

1. Simple random sampling

Non-Probability Sampling Methods

Convenience sampling is perhaps the easiest method of sampling, because participants

3. Judgement (or Purposive) Sampling

Judgement sampling has the advantage of being time-and cost-effective to perform

This method is commonly used in social sciences when investigating hard-to-reach

Snowball sampling can be effective when a sampling frame is difficult to identify.

1. Any pre-agreed sampling rules are deviated from

1. Low cost of sampling

2. Less time consuming in sampling

3. Scope of sampling is high

Some populations are so large that their characteristics could not be

6. Intensive and exhaustive data

7. Suitable in limited resources

Disadvantages of sampling may be discussed under the heads:

2. Difficulties in selecting a truly representative sample

3. In adequate knowledge in the subject

Sampling Error Vs Non-Sampling Error

BASIS FOR NON-SAMPLING