0% found this document useful (0 votes)
14 views14 pages

Statistics Handouts

Uploaded by

abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Statistics Handouts

Uploaded by

abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

STATISTICS

Statistics is a collection of methods for collecting, displaying, analyzing, and drawing


conclusions from data.

Statistics is the science of collecting, organizing, presenting, analyzing, and


interpreting data to assist in making more effective decisions

Statistics is a mathematical body of science that pertains to the collection, analysis,


interpretation or explanation, and presentation of data, or as a branch of mathematics.
Some consider statistics to be a distinct mathematical science rather than a branch of
mathematics.

Types of statistics:

 Descriptive statistics – Methods of organizing, summarizing, and presenting


data in an informative way

- summarizes data from a sample using indexes such as the


mean or standard deviation

 Inferential statistics – The methods used to determine something about a


population on the basis of a sample

- draws conclusions from data that are subject to random


variation (e.g., observational errors, sampling variation.

Statistical data

 The collection of data that are relevant to the problem being studied is commonly
the most difficult, expensive, and time-consuming part of the entire research
project.

 Statistical data are usually obtained by counting or measuring items.

 Primary data are collected specifically for the analysis desired

 Secondary data have already been compiled and are available for
statistical analysis

Statistical data are usually obtained by counting or measuring items. Most data can be
put into the following categories:

 Qualitative data are measurements that each fail into one of several categories.
(hair color, ethnic groups and other attributes of the population)
Qualitative data are generally described by words or letters. They are not as
widely used as quantitative data because many numerical techniques do not
apply to the qualitative data. For example, it does not make sense to find an
average hair color or blood type.

Qualitative data can be separated into two subgroups:

 dichotomic (if it takes the form of a word with two options (gender - male
or female)

 polynomic (if it takes the form of a word with more than two options
(education - primary school, secondary school and university).

 Quantitative data are observations that are measured on a numerical scale


(distance traveled to college, number of children in a family, etc.)

Quantitative data are always numbers and are the result of counting or
measuring attributes of a population.

Quantitative data can be separated into two subgroups:

 discrete (if it is the result of counting (the number of students of a given


ethnic group in a class, the number of books on a shelf, ...)

 continuous (if it is the result of measuring (distance traveled, weight of


luggage, …)

Continuous
Amount of income tax paid,
weight of a student
Numerical scale / Levels of measurement:

1. Nominal – consist of categories in each of which the number of respective


observations is recorded. The categories are in no logical order and have no
particular relationship. The categories are said to be mutually exclusive since
an individual, object, or measurement can be included in only one of them.

A pie chart displays groups of nominal


variables (i.e. categories).

Nominal is from the Latin nomalis,


which means “pertaining to names”.
It’s another name for a category.

Examples:

 Gender: Male, Female, Other.


 Hair Color: Brown, Black, Blonde, Red, Other.
 Type of living accommodation: House, Apartment, Trailer, Other.
 Genotype: Bb, bb, BB, bB.
 Religious preference: Buddhist, Mormon, Muslim, Jewish, Christian, Other.
2. Ordinal – contain more information. Consists of distinct categories in which order
is implied. Values in one category are larger or smaller than values in other
categories (e.g. rating-excelent, good, fair, poor)
The ordinal scale classifies according to rank.

Ordinal means in order. Includes “First,” “second” and “ninety ninth.”


Examples:

 High school class ranking: 1st, 9th, 87th…


 Socioeconomic status: poor, middle class, rich.
 The Likert Scale: strongly disagree, disagree, neutral, agree, strongly agree.
 Level of Agreement: yes, maybe, no.
 Time of Day: dawn, morning, noon, afternoon, evening, night.
 Political Orientation: left, center, right.

3. Interval – is a set of numerical measurements in which the distance between


numbers is of a known, sonstant size.

Interval has values of equal intervals that mean something. For example, a
thermometer might have intervals of ten degrees.

Examples:

 Celsius Temperature.
 Fahrenheit Temperature.
 IQ (intelligence scale).
 SAT scores.
 Time on a clock with hands.
4. Ratio – consists of numerical measurements where the distance between
numbers is of a known, constant size, in addition, there is a nonarbitrary zero
point.

Weight is measured on the ratio scale.

Ratio is exactly the same as the interval scale except


that the zero on the scale means: “does not exist”.
For example, a weight of zero doesn’t exist; an age of zero doesn’t exist. On the other
hand, temperature is not a ratio scale, because zero exists (i.e. zero on the Celsius
scale is just the freezing point; it doesn’t mean that water ceases to exist).

Examples:

 Age.*
 Weight.
 Height.
 Sales Figures.
 Ruler measurements.
 Income earned in a week.
 Years of education.
 Number of children.

Variable

Variable is a logical set of attributes. Variables can “vary” – for example, be high or low.
How high, or how low, is determined by the value of the attribute (and in fact, an
attribute could be just the word “low” or “high”).

A variable is a characteristic of a statistical unit being observed that may assume more
than one of a set of values to which a numerical measure or a category from a
classification can be assigned.

Dependent and Independent Variable

An independent variable, sometimes called an experimental or predictor variable, is a


variable that is being manipulated in an experiment in order to observe the effect on a
dependent variable, sometimes called an outcome variable.

Imagine that a tutor asks 100 students to complete a maths test. The tutor wants to
know why some students perform better than others. Whilst the tutor does not know the
answer to this, she thinks that it might be because of two reasons: (1) some students
spend more time revising for their test; and (2) some students are naturally more
intelligent than others. As such, the tutor decides to investigate the effect of revision
time and intelligence on the test performance of the 100 students. The dependent and
independent variables for the study are:

Dependent Variable: Test Mark (measured from 0 to 100)

Independent Variables: Revision time (measured in hours) Intelligence (measured


using IQ score)

The dependent variable is simply that, a variable that is dependent on an independent


variable(s). For example, in our case the test mark that a student achieves is dependent
on revision time and intelligence. Whilst revision time and intelligence (the independent
variables) may (or may not) cause a change in the test mark (the dependent variable),
the reverse is implausible; in other words, whilst the number of hours a student spends
revising and the higher a student's IQ score may (or may not) change the test mark that
a student achieves, a change in a student's test mark has no bearing on whether a
student revises more or is more intelligent (this simply doesn't make sense).

Therefore, the aim of the tutor's investigation is to examine whether these independent
variables - revision time and IQ - result in a change in the dependent variable, the
students' test scores. However, it is also worth noting that whilst this is the main aim of
the experiment, the tutor may also be interested to know if the independent variables -
revision time and IQ - are also connected in some way.

Population

Population –The entire set of individuals or objects of interest or the measurements


obtained from all individuals or objects of interest. For example, the population of
German people share a common geographic origin, language, literature, and genetic
heritage, among others traits, that distinguish them from people of different nationalities.
As another example, the Milky Way galaxy comprises a star population.

Typically, the population is very large, making a census or a complete enumeration of


all the values in the population impractical or impossible.

Sample – A portion, or
part, of the population
of interest. It is a set of
data collected and/or
selected from
a statistical population
by a defined procedure.
The sample usually
represents a subset of manageable size. Samples are collected and statistics are
calculated from the samples so that one can make a inferences or extrapolations from
the sample to the population.

Sampling is concerned with the selection of a subset of individuals from within a


statistical population to estimate characteristics of the whole population.

A sample should have the same characteristics as the population it is representing.

Sampling can be:

 with replacement: a member of the population may be chosen more than once
(picking the candy from the bowl)

 without replacement: a member of the population may be chosen only once


(lottery ticket)

Sampling methods

Sampling methods can be:

 random (each member of the population has an equal chance of being selected)

 nonrandom

The actual process of sampling causes sampling errors. For example, the sample may
not be large enough or representative of the population. Factors not related to the
sampling process cause non-sampling errors. A defective counting device can cause a
non-sampling error.

Random sampling methods


 simple random sample (each sample of the same size has an equal chance of
being selected)

 stratified sample (divide the population into groups called strata and then take a
sample from each stratum)

 cluster sample (divide the population into strata and then randomly select some
of the strata. All the members from these strata are in the cluster sample.)

 systematic sample (randomly select a starting point and take every n-th piece of
data from a listing of the population)

MEASURES OF CENTRAL TENDENCY

A central tendency (or measure of central tendency) is a central or typical value for a
probability distribution. It may also be called a center or location of the distribution.
Colloquially, measures of central tendency are often called averages. The term central
tendency dates from the late 1920s.

The most common measures of central tendency are the arithmetic mean, the median
and the mode. A central tendency can be calculated for either a finite set of values or for
a theoretical distribution, such as the normal distribution. Occasionally authors use
central tendency to denote "the tendency of quantitative data to cluster around some
central value."

The following may be applied to one-dimensional data. Depending on the circumstances, it may
be appropriate to transform the data before calculating a central tendency. Examples are squaring
the values or taking logarithms. Whether a transformation is appropriate and what it should be,
depend heavily on the data being analyzed.

Arithmetic mean or simply, mean

the sum of all measurements divided by the number of observations in the data
set.

Median

the middle value that separates the higher half from the lower half of the data set.
The median and the mode are the only measures of central tendency that can be
used for ordinal data, in which values are ranked relative to each other but are
not measured absolutely.

Mode

the most frequent value in the data set. This is the only central tendency
measure that can be used with nominal data, which have purely qualitative
category assignments.

Geometric mean

the nth root of the product of the data values, where there are n of these. This
measure is valid only for data that are measured absolutely on a strictly positive
scale.

Harmonic mean

the reciprocal of the arithmetic mean of the reciprocals of the data values. This
measure too is valid only for data that are measured absolutely on a strictly
positive scale.

Weighted arithmetic mean

an arithmetic mean that incorporates weighting to certain data elements.

Truncated mean or trimmed mean

the arithmetic mean of data values after a certain number or proportion of the
highest and lowest data values have been discarded.

Interquartile mean

a truncated mean based on data within the interquartile range.

Midrange

the arithmetic mean of the maximum and minimum values of a data set.

Midhinge

the arithmetic mean of the two quartiles.

Trimean
the weighted arithmetic mean of the median and two quartiles.

Winsorized mean

an arithmetic mean in which extreme values are replaced by values closer to the
median.

Any of the above may be applied to each dimension of multi-dimensional data, but the
results may not be invariant to rotations of the multi-dimensional space. In addition,
there are the

Geometric median

which minimizes the sum of distances to the data points. This is the same as the
median when applied to one-dimensional data, but it is not the same as taking
the median of each dimension independently. It is not invariant to different
rescaling of the different dimensions.

Quadratic mean (often known as the root mean square)

useful in engineering, but not often used in statistics. This is because it is not a
good indicator of the center of the distribution when the distribution includes
negative values.

Simplicial depth

the probability that a randomly chosen simplex with vertices from the given
distribution will contain the given center

Tukey median

a point with the property that every halfspace containing it also contains many
sample points

Frequency Distribution

A frequency distribution is an orderly arrangement of data classified according to the


magnitude of the observations. When the data are grouped into classes of appropriate
size indicating the number of observations in each class we get a frequency distribution.
By forming frequency distribution, we can summarize the data effectively. It is a method
of presenting the data in a summarized form. Frequency distribution is also known as
Frequency table.

Uses of Frequency Distribution


1. Frequency distribution helps us to analyze the data.
2. Frequency distribution helps us to estimate the frequencies of the population on the
basis of the ample.
3. Frequency distribution helps us to facilitate the computation of various statistical
measures

Frequency Distribution Table


Frequency distribution table (also known as frequency table) consists of various
components.

Classes: A large number of observations varying in a wide range are usually classified
in several groups according to the size of their values. Each of these groups is defined
by an interval called class interval. The class interval between 10 and 20 is defined as
10-20.

Class limits: The smallest and largest possible values in each class of a frequency
distribution table are known as class limits. For the class 10-20, the class limits are 10
and 20. 10 is called the lower class limit and 20 is called the upper class limit.

Class limit: Class limit is the midmost value of the class interval. It is also known as the
mid value.

Mid value of each class

= (lower limit + Upper limit)/2.

If the class is 0-10, lower limit is 0 and upper limit is 10. So the mid value is

(0+10)/2 = 10/2 = 5

.
Magnitude of a class interval: The difference between the upper and lower limit of a
class is called the magnitude of a class interval.

Class frequency: The number of observation falling within a class interval is called
class frequency of that class interval.
Construct a Frequency Distribution

A frequency distribution table is one way to organize data so that it makes more sense.
The data so distributed is called frequency distribution and the tabular form is called
frequency distribution table. Let us see with the help of example how to construct
distribution table.
The frequency distribution table lists all the marks and also show how many times
(frequency) they occurred.

The number which tells us how many times a particular data appears is called the
frequency. For example, 2 marks have been scored by five students which means
marks 2 occurs five times. Therefore, the frequency of score 2 is five. Similarly, the
frequency of marks 5 is three because three students scored five marks.

Relative Frequency Distribution


A relative frequency distribution is a distribution in which relative frequencies are
recorded against each class interval. Relative frequency of a class is the frequency
obtained by dividing frequency by the total frequency. Relative frequency is the
proportion of the total frequency that is in any given class interval in the frequency
distribution.

Relative Frequency Distribution Table

If the frequency of the frequency distribution table is changed into relative frequency
then frequency distribution table is called as relative frequency distribution table. For a
data set consisting of n values. If f is the frequency of a particular value then the ratio 'fn

' is called its relative frequency.


Solved Example
Question: Find the relative frequency from the data given below:

Class interval Frequency

20-25 10

25-30 12

30-35 8

35-40 20

40-45 11

45-50 4

50-55 5

Solution:

Relative frequency distribution table for the given data.

Here n = 70

Class interval Frequency (f) Relative Cumulative Frequency (fn)

20-25 10 10 / 70 = 0.143

25-30 12 12 / 70 = 0.171

30-35 8 8 / 70 = 0.114

35-40 20 20 / 70 = 0.286

40-45 11 11 / 70 = 0.157

45-50 4 4 / 7 0 = 0.057
50-55 5 5 / 70 = 0.071

Total n = 70

Cumulative Frequency Distribution

One of the important type of frequency distribution is Cumulative frequency distribution.


In cumulative frequency distribution, the frequencies are shown in the cumulative
manner. The cumulative frequency for each class interval is the frequency for that class
interval added to the preceding cumulative total. Cumulative frequency can also defined
as the sum of all previous frequencies up to the current point.

Cumulative Relative Frequency Distribution

Cumulative relative frequency distribution is one type of frequency distribution. The


relative cumulative frequency is the cumulative frequency divided by the total frequency.

You might also like