Statistics Handouts
Statistics Handouts
Types of statistics:
Statistical data
The collection of data that are relevant to the problem being studied is commonly
the most difficult, expensive, and time-consuming part of the entire research
project.
Secondary data have already been compiled and are available for
statistical analysis
Statistical data are usually obtained by counting or measuring items. Most data can be
put into the following categories:
Qualitative data are measurements that each fail into one of several categories.
(hair color, ethnic groups and other attributes of the population)
Qualitative data are generally described by words or letters. They are not as
widely used as quantitative data because many numerical techniques do not
apply to the qualitative data. For example, it does not make sense to find an
average hair color or blood type.
dichotomic (if it takes the form of a word with two options (gender - male
or female)
polynomic (if it takes the form of a word with more than two options
(education - primary school, secondary school and university).
Quantitative data are always numbers and are the result of counting or
measuring attributes of a population.
Continuous
Amount of income tax paid,
weight of a student
Numerical scale / Levels of measurement:
Examples:
Interval has values of equal intervals that mean something. For example, a
thermometer might have intervals of ten degrees.
Examples:
Celsius Temperature.
Fahrenheit Temperature.
IQ (intelligence scale).
SAT scores.
Time on a clock with hands.
4. Ratio – consists of numerical measurements where the distance between
numbers is of a known, constant size, in addition, there is a nonarbitrary zero
point.
Examples:
Age.*
Weight.
Height.
Sales Figures.
Ruler measurements.
Income earned in a week.
Years of education.
Number of children.
Variable
Variable is a logical set of attributes. Variables can “vary” – for example, be high or low.
How high, or how low, is determined by the value of the attribute (and in fact, an
attribute could be just the word “low” or “high”).
A variable is a characteristic of a statistical unit being observed that may assume more
than one of a set of values to which a numerical measure or a category from a
classification can be assigned.
Imagine that a tutor asks 100 students to complete a maths test. The tutor wants to
know why some students perform better than others. Whilst the tutor does not know the
answer to this, she thinks that it might be because of two reasons: (1) some students
spend more time revising for their test; and (2) some students are naturally more
intelligent than others. As such, the tutor decides to investigate the effect of revision
time and intelligence on the test performance of the 100 students. The dependent and
independent variables for the study are:
Therefore, the aim of the tutor's investigation is to examine whether these independent
variables - revision time and IQ - result in a change in the dependent variable, the
students' test scores. However, it is also worth noting that whilst this is the main aim of
the experiment, the tutor may also be interested to know if the independent variables -
revision time and IQ - are also connected in some way.
Population
Sample – A portion, or
part, of the population
of interest. It is a set of
data collected and/or
selected from
a statistical population
by a defined procedure.
The sample usually
represents a subset of manageable size. Samples are collected and statistics are
calculated from the samples so that one can make a inferences or extrapolations from
the sample to the population.
with replacement: a member of the population may be chosen more than once
(picking the candy from the bowl)
Sampling methods
random (each member of the population has an equal chance of being selected)
nonrandom
The actual process of sampling causes sampling errors. For example, the sample may
not be large enough or representative of the population. Factors not related to the
sampling process cause non-sampling errors. A defective counting device can cause a
non-sampling error.
stratified sample (divide the population into groups called strata and then take a
sample from each stratum)
cluster sample (divide the population into strata and then randomly select some
of the strata. All the members from these strata are in the cluster sample.)
systematic sample (randomly select a starting point and take every n-th piece of
data from a listing of the population)
A central tendency (or measure of central tendency) is a central or typical value for a
probability distribution. It may also be called a center or location of the distribution.
Colloquially, measures of central tendency are often called averages. The term central
tendency dates from the late 1920s.
The most common measures of central tendency are the arithmetic mean, the median
and the mode. A central tendency can be calculated for either a finite set of values or for
a theoretical distribution, such as the normal distribution. Occasionally authors use
central tendency to denote "the tendency of quantitative data to cluster around some
central value."
The following may be applied to one-dimensional data. Depending on the circumstances, it may
be appropriate to transform the data before calculating a central tendency. Examples are squaring
the values or taking logarithms. Whether a transformation is appropriate and what it should be,
depend heavily on the data being analyzed.
the sum of all measurements divided by the number of observations in the data
set.
Median
the middle value that separates the higher half from the lower half of the data set.
The median and the mode are the only measures of central tendency that can be
used for ordinal data, in which values are ranked relative to each other but are
not measured absolutely.
Mode
the most frequent value in the data set. This is the only central tendency
measure that can be used with nominal data, which have purely qualitative
category assignments.
Geometric mean
the nth root of the product of the data values, where there are n of these. This
measure is valid only for data that are measured absolutely on a strictly positive
scale.
Harmonic mean
the reciprocal of the arithmetic mean of the reciprocals of the data values. This
measure too is valid only for data that are measured absolutely on a strictly
positive scale.
the arithmetic mean of data values after a certain number or proportion of the
highest and lowest data values have been discarded.
Interquartile mean
Midrange
the arithmetic mean of the maximum and minimum values of a data set.
Midhinge
Trimean
the weighted arithmetic mean of the median and two quartiles.
Winsorized mean
an arithmetic mean in which extreme values are replaced by values closer to the
median.
Any of the above may be applied to each dimension of multi-dimensional data, but the
results may not be invariant to rotations of the multi-dimensional space. In addition,
there are the
Geometric median
which minimizes the sum of distances to the data points. This is the same as the
median when applied to one-dimensional data, but it is not the same as taking
the median of each dimension independently. It is not invariant to different
rescaling of the different dimensions.
useful in engineering, but not often used in statistics. This is because it is not a
good indicator of the center of the distribution when the distribution includes
negative values.
Simplicial depth
the probability that a randomly chosen simplex with vertices from the given
distribution will contain the given center
Tukey median
a point with the property that every halfspace containing it also contains many
sample points
Frequency Distribution
Classes: A large number of observations varying in a wide range are usually classified
in several groups according to the size of their values. Each of these groups is defined
by an interval called class interval. The class interval between 10 and 20 is defined as
10-20.
Class limits: The smallest and largest possible values in each class of a frequency
distribution table are known as class limits. For the class 10-20, the class limits are 10
and 20. 10 is called the lower class limit and 20 is called the upper class limit.
Class limit: Class limit is the midmost value of the class interval. It is also known as the
mid value.
If the class is 0-10, lower limit is 0 and upper limit is 10. So the mid value is
(0+10)/2 = 10/2 = 5
.
Magnitude of a class interval: The difference between the upper and lower limit of a
class is called the magnitude of a class interval.
Class frequency: The number of observation falling within a class interval is called
class frequency of that class interval.
Construct a Frequency Distribution
A frequency distribution table is one way to organize data so that it makes more sense.
The data so distributed is called frequency distribution and the tabular form is called
frequency distribution table. Let us see with the help of example how to construct
distribution table.
The frequency distribution table lists all the marks and also show how many times
(frequency) they occurred.
The number which tells us how many times a particular data appears is called the
frequency. For example, 2 marks have been scored by five students which means
marks 2 occurs five times. Therefore, the frequency of score 2 is five. Similarly, the
frequency of marks 5 is three because three students scored five marks.
If the frequency of the frequency distribution table is changed into relative frequency
then frequency distribution table is called as relative frequency distribution table. For a
data set consisting of n values. If f is the frequency of a particular value then the ratio 'fn
20-25 10
25-30 12
30-35 8
35-40 20
40-45 11
45-50 4
50-55 5
Solution:
Here n = 70
20-25 10 10 / 70 = 0.143
25-30 12 12 / 70 = 0.171
30-35 8 8 / 70 = 0.114
35-40 20 20 / 70 = 0.286
40-45 11 11 / 70 = 0.157
45-50 4 4 / 7 0 = 0.057
50-55 5 5 / 70 = 0.071
Total n = 70