Descr Iptive Statis Tics: Inferential Statistics
Descr Iptive Statis Tics: Inferential Statistics
Descr Iptive Statis Tics: Inferential Statistics
IPTIVE
STATIS
Descriptive statistics are used to describe the basic features of the
data in a study. They provide simple summaries about TICS
the sample and
the measures. Together with simple graphics analysis, they form the
basis of virtually every quantitative analysis of data.
UNIVA
RIATE
ANALY
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
SIS
variable that we tend to look at:
the distribution
the dispersion
MEAS
URE
OF
A measure of central tendency is a summary statistic that represents the centre point
CENTR
or typical value of a dataset. These measures indicate where most values in a
distribution fall and are also referred to as the central location of a distribution. You
can think of it as the tendency of data to cluster around a middle value. In statistics,
AL
the three most common measures of central tendency are the mean, median,
and mode. Each of these measures calculates the location of the central point using
a different method.
TENDE
Choosing the best measure of central tendency depends on the type of data you
NCY
have. In this post, I explore these measures of central tendency, show you how to
calculate them, and how to determine which one is best for your data.
Mean
The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar. Calculating the mean is very simple. You just
add up all of the values and divide by the number of observations in your dataset.
The calculation of the mean incorporates all values in the data. If you change any
value, the mean changes. However, the mean doesn’t always locate the center of
the data accurately. Observe the histograms below where I display the mean in the
distributions.
Median
The median is the middle value. It is the value that splits the dataset in half. To find
the median, order your data from smallest to largest, and then find the data point that
has an equal amount of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd
number of values. I’ll show you how to find the median for both cases. In the
examples below, I use whole numbers for simplicity, but you can have decimal
places.
In the dataset with the odd number of observations, notice how the number 12 has
six values above it and six below it. Therefore, 12 is the median of this dataset.
When there is an even number of values, you count in to the two innermost values
and then take the average. The average of 27 and 29 is 28. Consequently, 28 is the
median of this dataset.
Outliers and skewed data have a smaller effect on the median. To understand why,
imagine we have the Median dataset below and find that the median is 46. However,
we discover data entry errors and need to change four values, which are shaded in
the Median Fixed dataset. We’ll make them all significantly higher so that we now
have a skewed distribution with large outliers.
As you can see, the median doesn’t change at all. It is still 46. Unlike the mean, the
median value doesn’t depend on all the values in the dataset. Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course,
with other types of changes, the median can change. When you have a skewed
distribution, the median is a better measure of central tendency than the mean.
Comparing the mean and median
Now, let’s test the median on the symmetrical and skewed distributions to see how it
performs, and I’ll include the mean on the histograms so we can make comparisons.
In a symmetric distribution, the mean and median both find the center accurately.
They are approximately equal.
In a skewed distribution, the outliers in the tail pull the mean away from the center
towards the longer tail. For this example, the mean and median differ by over 9000,
and the median better represents the central tendency for the distribution.
These data are based on the U.S. household income for 2006. Income is the classic
example of when to use the median because it tends to be skewed. The median
indicates that half of all incomes fall below 27581, and half are above it. For these
data, the mean overestimates where most household incomes fall.
Mode
The mode is the value that occurs the most frequently in your data set. On a bar
chart, the mode is the highest bar. If the data have multiple values that are tied for
occurring the most frequently, you have a multimodal distribution. If no value
repeats, the data do not have a mode.
In the dataset below, the value 5 occurs most frequently, which makes it the mode.
These data might represent a 5-point Likert scale.
Typically, you use the mode with categorical, ordinal, and discrete data. In fact, the
mode is the only measure of central tendency that you can use with categorical data
—such as the most preferred flavor of ice cream. However, with categorical data,
there isn’t a central value because you can’t order the groups. With ordinal and
discrete data, the mode can be a value that is not in the center. Again, the mode
represents the most common value.
In the graph of service quality, Very Satisfied is the mode of this distribution because
it is the most common value in the data. Notice how it is at the extreme end of the
distribution. I’m sure the service providers are pleased with these results!
When you are working with the raw continuous data, don’t be surprised if there is no
mode. However, you can find the mode for continuous data by locating the maximum
value on a probability distribution plot. If you can identify a probability distribution that
fits your data, find the peak value and use it as the mode.
The probability distribution plot displays a lognormal distribution that has a mode of
16700. This distribution corresponds to the U.S. household income example in the
median section.
When to use the mode: Categorical data, Ordinal data, Count data, Probability
Distributions
When you have ordinal data, the median or mode is usually the best choice. For
categorical data, you have to use the mode.
In cases where you are deciding between the mean and median as the better
measure of central tendency, you are also determining which types of
statistical hypothesis testsare appropriate for your data—if that is your ultimate goal.
I have written an article that discusses when to use parametric (mean) and
MEAS
nonparametric (median) hypothesis testsalong with the advantages .
URE
OF
As the name suggests, the measure of dispersion shows the
scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The
DISPE
measure of dispersion shows the homogeneity or the heterogeneity of
the distribution of the observations.
RSION
Suppose you have four datasets of the same size and the mean is also
same, say, m. In all the cases the sum of the observations will be the
METH
same. Here, the measure of central tendency is not giving a clear and
complete idea about the distribution for the four given sets.
OD OF
Can we get an idea about the distribution if we get to know about the
MEAS
dispersion of the observations from one another within and between
the datasets? The main idea about the measure of dispersion is to get
URE
to know how the data are spread. It shows how much the data vary
from their average value.
OF
DISPE
RSION
By now, you must have come across or learnt different measures of central
tendency. Measures of central tendency facilitate the representation of the entire
mass of the data with a single value. Can the central tendency describe the data
wholly and accurately? No, and that is precisely why we need measures of
dispersion. For instance, the hourly income of professionals in two offices are:
Here, evidently, the mean of both the sections is the same, that is, 65
In office A, the observations are much more away from the mean.
In office B, almost all the observations are close to the mean. Certainly,
both the offices differ even though their mean is the same.
Therefore it is required to differentiate between the groups. We need some other
measures with regards to the measure of scattered-ness (or spread). For this
purpose, we study this topic known as measures of dispersion.
Types of Dispersion
The measures of dispersion can be ‘absolute’ or ‘relative’. In the case of absolute
measures of dispersion, they are stated in the same units in which the original data
is expressed. For instance, if a group of data expresses the number of shoes a group
of people own; the absolute dispersion will provide the values in numbers.
(i) Mathematical Methods: We can study the ‘degree’ and ‘extent’ of variation with
the use of these methods. The measures of dispersion included in this category are :
(a) Range
(ii) Graphic Methods: If only the extent of variation is studied, whether it is higher
or lower, a Lorenz-curve is put to use.
The three distributions below represent different data conditions. In each distribution,
look for the region where the most common values fall. Even though the shapes and
type of data are different, you can find that central location. That’s the area in the
distribution where the most common values are located.
As the graphs highlight, you can see where most values tend to occur. That’s the
concept. Measures of central tendency represent this idea with a value. Coming up,
you’ll learn that as the distribution and kind of data changes, so does the best
measure of central tendency. Consequently, you need to know the type of data you
have, and graph it, before choosing a measure of central tendency!
Mathematical Methods
Range
Two sections of 10 students each in class XII in a school were given a common
test in Economics (40 maximum marks). The scores of the students are given
below:
The average score in section A is 19. The average score in section B is 19.
the scores of all the students in section A are ranging from 6 to 35;
the scores of the students in section B are ranging from 15 to 25.
The difference between the largest and the smallest scores in section A is 29 (35-6)
The difference between the largest and smallest scores in section B is 10 (25-15)
Thus, the difference between the largest and the smallest value of a data, is termed
as the range of the distribution. Range does not consider all the values of a series,
i.e. it takes only the extreme items and middle items are not considered significant.
Therefore, Range is not sufficient to explain about the character of the distribution.
The concept of range is useful in the field of quality control and to study the
variations in the prices of the shares etc.
Quartile Deviation
The quartile deviation is a slightly better measure of absolute dispersion than the
range, although it ignores the observations on the ends (tails). It helps in knowing
the range within which certain proportion of the items fall lay. It only considers the
values of the ‘Upper quartile’ (Q3) and the ‘Lower quartile’ (Q1).
The Inter-Quartile Range is based upon the 50% of the values in a distribution
which lay in the middle; and hence is unaffected by extreme values. Half of the
Inter-Quartile Range is called Quartile Deviation (Q.D.).
In individual and discrete series, Q1 is the size of [(n +1)/ 4] th value, but in a
continuous distribution, it is the size of n/4 th value. Similarly, for Q3 and median
also, n is used in place of n+1.
Average Deviation
Average deviation can be defined as the arithmetic mean of the absolute deviations
(ignoring the negative signs) of various items from Mean, Mode or Median.
Discrete Series
Continuous Series
Where,
MD = Mean deviation
F = Number of frequencies.
Standard Deviation
Standard deviation is one of the best and popularly used measures of dispersion.
Standard deviation is the square root of the arithmetic mean of the squares of
deviation of its items from their arithmetic mean. The concept of standard
deviation, which was introduced by Karl Pearson is useful in assessing the
representativeness of the mean. It has a practical significance because it does not
come with the problems associated with a range, quartile deviation or average
deviation.
Individual Series:
1. Actual Mean Method
Discrete/Continuous Series:
1. Actual Mean Method
x C
Coefficient of variation:
This Is the most apt measure when two or more groups of similar data are to be
compared with respect to stability (or uniformly or consistency or homogeneity). It
is the ratio of the standard deviation to the mean.
Graphical Methods
Lorenz Curve: A Lorenz Curve can be defined as a graph on which the
cumulative percentage of total national income (or some other variable) is plotted
against the cumulative percentage of the corresponding population (ranked in
increasing size of share). The extent to which the curve sags below a straight
DESCR
diagonal line indicates the degree of inequality of distribution.
IPTIVE
STATI
STICS
Perhaps the most common Data Analysis tool that you’ll use in
USING
Excel is the one for calculating descriptive statistics. To see how
this works, take a look at this worksheet. It summarizes sales data
for a book publisher. MS
EXCEL
In column A, the worksheet shows the suggested retail price (SRP).
In column B, the worksheet shows the units sold of each book
through one popular bookselling outlet. You might choose to use
the Descriptive Statistics tool to summarize this data set.
To calculate descriptive statistics for the data set, follow these
steps:
1. Click the Data tab’s Data Analysis command button to tell
Excel that you want to calculate descriptive statistics.
Excel displays the Data Analysis dialog box.
Median Shows the middle value in the data set (the value that
separates the largest half of the values from the smallest half of
the values).
Standard Deviation Shows the sample standard deviation measure for the data
set.
Sample Variance Shows the sample variance for the data set (the squared
standard deviation).
Range Shows the difference between the largest and smallest values in
the data set.
Sum Adds all the values in the data set together to calculate the
sum.
SAMPL
ING
AND
STATIS
SAMPLING
TICAL
Sampling is a process used in statistical analysis in which a predetermined
number of observations are taken from a larger population. The methodology
INFERE
used to sample from a larger population depends on the type of analysis being
performed but may include simple random sampling or systematic sampling.
NCE
In business, a CPA performing an audit uses sampling to determine the
accuracy of account balances in the financial statements, and mana
METHOD OF
SAMPLING
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this
has to be balanced against having a large enough sample size with enough power to
detect a true association. (Calculation of sample size is addressed in section 1B
(statistics) of the Part A syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups. For example, if the electoral roll for a town
was used to identify participants, some people, such as the homeless, would not be
registered and therefore excluded from the study by default.
There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalise the results from
your study. Probability sampling methods tend to be more time-consuming and
expensive than non-probability sampling. In non-probability (non-random) sampling, you
do not start with a complete sampling frame, so some individuals have no chance of
being selected. Consequently, you cannot estimate the effect of sampling error and there
is a significant risk of ending up with a non-representative sample which produces non-
generalisable results. However, non-probability sampling methods tend to be cheaper
and more convenient, and they are useful for exploratory research and hypothesis
generation.
In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of obtaining a
random sample is to give each individual in a population a number, and then use a table
of random numbers to decide which individuals to include.1 For example, if you have a
sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from
the random number table to pick your sample. So, if the first three numbers from the
random number table were 094, select the individual labelled “94”, and so on.
As with all probability sampling methods, simple random sampling allows the sampling
error to be calculated and reduces selection bias. A specific advantage is that it is the
most straightforward method of probability sampling. A disadvantage of simple random
sampling is that you may not select enough individuals with your characteristic of
interest, especially if that characteristic is uncommon. It may also be difficult to define a
complete sampling frame and inconvenient to contact them, especially if different forms
of contact are required (email, phone, post) and your sample units are scattered over a
wide geographical area.
2. Systematic sampling
Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample. For
example, if you wanted a sample size of 100 from a population of 1000, select every
1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is
easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that the
sampling technique coincides with the periodicity of the underlying pattern. As a
hypothetical example, if a group of students were being sampled to gain their opinions
on college facilities, but the Student Record Department’s central list of all students was
arranged such that the sex of students alternated between male and female, choosing
an even interval (e.g. every 20thstudent) would result in a sample of all males or all
females. Whilst in this example the bias is obvious and should be easily corrected, this
may not always be the case.
3. Stratified sampling
In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample
is then obtained by taking equal sample sizes from each stratum. In stratified sampling,
it may also be appropriate to choose non-equal sample sizes from each stratum. For
example, in a study of the health outcomes of nursing staff in a county, if there are
three hospitals each with different numbers of nursing staff (hospital A has 500 nurses,
hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose
the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation
of the health outcomes of nurses across the county, whereas simple random sampling
would over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always available), and
it can be difficult to decide which characteristic(s) to stratify by.
4. Clustered sampling
In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In
two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The
General Household survey, which is undertaken annually in England, is a good example
of a (one-stage) cluster sample. All members of the selected households (clusters) are
included in the survey.1
Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact
lots of individuals in a few GP practices than a few individuals in many different GP
practices. Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.
1. Convenience sampling
2. Quota sampling
This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and
10 teenage boys so that they could interview them about their television viewing. Ideally
the quotas chosen would proportionally represent the characteristics of the underlying
population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
Also known as selective, or subjective, sampling, this technique relies on the judgement
of the researcher when choosing who to ask to participate. Researchers may implicitly
thus choose a “representative” sample to suit their needs, or specifically approach
individuals with certain characteristics. This approach is often used by the media when
canvassing the public for opinions and in qualitative research.
4. Snowball sampling
Bias in sampling
There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:1
Advantages of sampling
Sampling ensures convenience, collection of intensive and
exhaustive data, suitability in limited resources and better rapport.
In addition to this, sampling has the following advantages also.
5. Organization of convenience
Organizational problems involved in sampling are very few. Since
sample is of a small size, vast facilities are not required. Sampling is
therefore economical in respect of resources. Study of samples
involves less space and equipment.
8. Better rapport
An effective research study requires a good rapport between the
researcher and the respondents. When the population of the study
is large, the problem of rapport arises. But manageable samples
permit the researcher to establish adequate rapport with the
respondents.
Disadvantages of sampling
The reliability of the sample depends upon the appropriateness of
the sampling method used. The purpose of sampling theory is to
make sampling more efficient. But the real difficulties lie in
selection, estimation and administration of samples.
Some of the cases of sample may not cooperate with the researcher
and some others may be inaccessible. Because of these problems, all
the cases may not be taken up. The selected cases may have to be
replaced by other cases. Changeability of units stands in the way of
results of the study.
5. Impossibility of sampling
Deriving a representative sample is di6icult, when the universe is
too small or too heterogeneous. In this case, census study is the only
alternative. Moreover, in studies requiring a very high standard of
accuracy, the sampling method may be unsuitable. There will be
chances of errors even if samples are drawn most carefully.
SAMPLING VS NON-
SAMPLING ERROR
Sampling error is one which occurs due to unrepresentativeness of the
sample selected for observation. Conversely, non-sampling error is an
error arise from human error, such as error in problem identification, method
or procedure used, etc.
An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.
In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.
Comparison Chart
The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.
Surrogate Error
Sampling Error
Measurement Error
Data Analysis Error
Population Definition Error
Respondent Error
Inability Error
Unwillingness Error
Interviewer Error
Questioning Error
Recording Erro
Respondent Selection Error
Cheating Error
Non-Response Error: Error arising due to some respondents
who are a part of the sample do not respond.
Conclusion
To end this discussion, it is true to say that sampling error is one which is
completely related to the sampling design and can be avoided, by expanding
the sample size. Conversely, non-sampling error is a basket that covers all the
errors other than the sampling error and so, it unavoidable by nature as it is
not possible to completely remove it.