Intro To Statistics
Intro To Statistics
Introduction
What is statistics?
The Word statistics has been derived from Latin word “Status” or the Italian word “Statista”, the
meaning of these words is “Political State” or a Government. Early applications of statistical thinking
revolved around the needs of states to base policy on demographic and economic data.
Definition
Statistics: a branch of science that deals with the collection, presentation, analysis, and interpretation of
data.
Statistics is divided into 2 broad categories namely descriptive and inferential statistics. Descriptive
Statistics: summary values and presentations which gives some information about the data Eg the mean
height of a 1st year student in JKUAT is170cm. 170cm is a statistics which describes the central point of
the heights data.
Inferential Statistics: summary values calculated from the sample in order to make conclusions about
the target population.
Types of Variables
Qualitative Variables: Variables whose values fall into groups or categories. They are called
categorical variables and are further divided into 2 classes namely nominal and ordinal variables
a) Nominal variables: variables whose categories are just names with no natural ordering. Eg gender
marital status, skin colour, district of birth etc
b) Ordinal variables: variables whose categories have a natural ordering. Eg education level,
performance category, degree classifications etc
Quantitative Variables: these are numeric variables and are further divided into 2 classes namely
discrete and continuous variables
a) Discrete variables: can only assume certain values and there are gaps between them. Eg the
number of calls one makes in a day, the number of vehicles passing through a certain point etc
b) Continuous variables: can assume any value in a specified range. Eg length of a telephone call,
height of a 1st year student in JKUAT etc
1. Data Collection:
1.1 Sources of Data
There are 2 sources for data collection namely Primary, and Secondary data
Primary data:- freshly collected ie for the first time. They are original in character ie they are the
first hand information collected, compiled and published for some purpose. They haven’t undergone
any statistical treatment
Secondary Data:- 2nd hand information mainly obtained from published sources such as statistical
abstracts books encyclopaedias periodicals, media reports eg census report CD-roms and other
electronic devices, internet. They are not original in character and have undergone some statistical
treatment at least once.
Experimental methods are so called because in them the investigator in a laboratory tests the
hypothesis about the cause and effect relationship by manipulating the independent variables
under controlled conditions.
Non-Experimental methods are so called because in them the investigator does not control or change
any aspect of the situation under study but simply describes what naturally occurs at a certain point or
period of time.
Non-Experimental methods are widely used in social sciences. Some of the Non-
Experimental methods used for data collection are outlined below.
a) Field study:- aims at testing hypothesis in natural life situations. It differs from field experiment
in that the researcher does not control or manipulate the independent variables but both of them
are carried out in natural conditions
b) Census. A census is a study that obtains data from every member of a population (totality of
individuals /items pertaining to certain characteristics). In most studies, a census is not practical,
because of the cost and/or time required.
c) Sample survey. A sample survey is a study that obtains data from a subset of a population, in order
to estimate population attributes/ characteristics. Surveys of human populations and institutions are
common in government, health, social science and marketing research.
d) Case study –It’s a method of intensively exploring and analyzing the life of a single social unit be it
a family, person, an institution, cultural group or even an entire community. In this method no
attempt is made to exercise experimental or statistical control and phenomena related to the unit are
studied in natural. The researcher has several discretion in gathering information from a variety of
sources such as diaries, letters, autobiographies, records in office, files or personal interviews.
e) Experiment. An experiment is a controlled study in which the researcher attempts to understand
cause-and-effect relationships. In experiments actual experiment is carried out on certain individuals
/ units about whom information is drawn. The study is "controlled" in the sense that the researcher
controls how subjects are assigned to groups and which treatments each group receives.
f) Observational study. Like experiments, observational studies attempt to understand cause-and-effect
relationships. However, unlike experiments, the researcher is not able to control how subjects are
assigned to groups and/or which treatments each group receives. Under this method information, is
sought by direct observation by the investigator.
Sampling Frames
For probability sampling, we must have a list of all the individuals (units) in the population. This list
or sampling frame is the basis for the selection process of the sample. “A [sampling] frame is a clear
and concise description of the population under study, by virtue of which the population units can
be identified unambiguously and contacted, if desired, for the purpose of the survey” - Hedayet and
Sinha, 1991
Based on the sampling frame, the sampling design could also be classified as:
Individual Surveys if List of individuals is available or when the size of population is small
Special population
Household Surveys; If it’s Based on the census of the households and if the individual level
information is unlikely to be available In practice, it’s limited to small geographical areas and know
as “area sampling frame” Example: Demographic and Health Surveys (DHS)
Institutional Surveys If it’s Based on the census of say Hospital/clinic lists eg
i) 1990 National Hospital Discharge Survey
ii) National Ambulatory Medical Care Survey
Non-probability sampling is any sampling method where some elements of the population have no
chance of selection (also referred to as “out of coverage”/”undercovered”), or where the probability of
selection can't be accurately determined. It yields a non-random sample therefore making it difficult
to extrapolate from the sample to the population. They include; Judgement sample, purposive sample,
convenience sample: subjective Snow-ball sampling: rare group/disease study
b)..Systematic Sampling
Systematic sampling, either by itself or in combination with some other method, may be the most
widely used method of sampling.” In systematic sampling we select samples “evenly” from the list
(sampling frame): First, let us consider that we are dividing the list evenly into some “blocks”. Then,
we
select a sample element from each block.
In systematic sampling, only the first unit is selected at random, the rest being selected according to a
predetermined pattern. To select a systematic sample of n units, the first unit is selected with a random
start r from 1 to k sample, where k=N/n sample intervals, and after the selection of first sample, every
kth unit is included where 1≤ r ≤ k.
c)..Stratified Sampling
In stratified sampling the population is partitioned into groups, called strata, and sampling is
performed separately within each stratum.
This sampling technique is used when;
i) Population groups may have different values for the responses of interest.
ii) we want to improve our estimation for each group separately.
iii) To ensure adequate sample size for each group.
d)..Cluster Sampling
In many practical situations the population elements are grouped into a number of clusters. A list of
clusters can be constructed as the sampling frame but a complete list of elements is often unavailable,
or too expensive to construct. In this case it is necessary to use cluster sampling where a random
sample of clusters is taken and some or all elements in the selected clusters are observed. Cluster
sampling is also preferable in terms of cost, because it is much cheaper, easier and quicker to collect
data from adjoining elements than elements chosen at random. On the other hand, cluster sampling is
less informative and less efficient per elements in the sample, due to similarities of elements within
the same cluster. The loss of efficiency, however, can often be compensated by increasing the overall
sample size. Thus, in terms of unit cost, the cluster sampling plan is efficient.
e)..Multi-Stage Samples
Here the respondents are chosen through a process of defined stages. Eg residents within
Kibera (Nairobi) may have been chosen for a survey through the following process:
Throughout the country (Kenya) the Nairobi may have been selected at random, ( stage 1), within
Nairobi, Langata (constituency) is selected again at random (stage 2), Kibera is then selected
within Langata (stage 3), then polling stations from Kibera (stage 4) and then individuals from the
electoral voters’ register (stage 5)! As demonstrated five stages were gone through before the final
selection of respondents were selected from the electoral voters’ register.
a)..Convinience Sampling
It’s a method of choosing subjects who are available or easy to find. This method is also
sometimes
referred to as haphazard, accidental, or availability sampling. The primary advantage of the
method is that it is very easy to carry out, relative to other methods.
b)..Quota Sampling
Quota sampling is designed to overcome the most obvious flaw of availability sampling. Rather
than taking just anyone, you set quotas to ensure that the sample you get represents certain
characteristics in proportion to their prevalence in the population. Note that for this method, you
have to know something about the characteristics of the population ahead of time. Say you want
to make sure you have a sample proportional to the population in terms of gender - you have to
know what percentage of the population is male and female, then collect sample until yours
matches. Marketing studies are particularly fond of this form of research design.
The primary problem with this form of sampling is that even when we know that a quota sample
is representative of the particular characteristics for which quotas have been set, we have no way
of knowing if sample is representative in terms of any other characteristics. If we set quotas for
gender and age, we are likely to attain a sample with good representativeness on age and gender,
but one that may not be very representative in terms of income and education or other factors.
Moreover, because researchers can set quotas for only a small fraction of the characteristics relevant to
a study quota sampling is really not much better than availability sampling. To reiterate, you must
know the characteristics of the entire population to set quotas; otherwise there's not much point to
setting up quotas. Finally, interviewers often introduce bias when allowed to self-select respondents,
which is usually the case in this form of research. In choosing males 18-25, interviewers are more
likely to
choose those that are better-dressed, seem more approachable or less threatening. That may
be understandable from a practical point of view, but it introduces bias into research
findings.
Imagine that a researcher wants to understand more about the career goals of students at a single
university. Let’s say that the university has roughly 10,000 students. suppose we were interested in
comparing the differences in career goals between male and female students at the single university.
If this was the case, we would want to ensure that the sample we selected had a proportional
number of male and female students relative to the population. To create a quota sample, there are
three steps:
Choose the relevant grouping chsr and divide the population accordingly gender
Calculate a quota (number of units that should be included in each for group
Continue to invite units until the quota for each group is met
c)..Purposive Sampling
Purposive sampling is a sampling method in which elements are chosen based on purpose of the
study. Purposive sampling may involve studying the entire population of some limited group or a
subset of a population. As with other non-probability sampling methods, purposive sampling does not
produce a sample that is representative of a larger population, but it can be exactly what is needed in
some cases - study of organization, community, or some other clearly defined and relatively limited
group.
d)..Snowball Sampling
Snowball sampling is a method in which a researcher identifies one member of some population of
interest, speaks to him/her, and then asks that person to identify others in the population that the
researcher might speak to. This person is then asked to refer the researcher to yet another person, and
so on.
Snowball sampling is very good for cases where members of a special population are difficult to
locate. For example,.populations that are subject to social stigma and marginalisation, such as suffers
of AIDS/HIV, as well as individuals engaged in illicit or illegal activities, including prostitution and
drug use. Snowball sampling is useful in such scenarios because:
The method creates a sample with questionable representativeness. A researcher is not sure who is in
the sample. In effect snowball sampling often leads the researcher into a realm he/she knows little
about. It can be difficult to determine how a sample compares to a larger population. Also, there's an
issue of who respondents refer you to - friends refer to friends, less likely to refer to ones they don't
like, fear, etc.
Snowball sampling is a useful choice of sampling strategy when the population you are interested
in studying is hidden or hard-to-reach.
Disadvantages:
i) Requires simple questions. The questions must be straightforward enough to be
comprehended solely on the basis of printed instructions and definitions.
ii) No opportunity for probing. The answers must be accepted as final. Researchers have
no opportunity to clarify ambiguous answers.
iii) Low response rate; respondents may not respond to all questions and/or may not
return questionnaire
iv) The respondent must be literate to read and understand the questionnaire
v) Introduce self selection bias
vi) Not suitable for complex questionnaire
Disadvantages
i) Higher cost. Costs are involved in selecting, training, and supervising interviewers; perhaps
in paying them; and in the travel and time required to conduct interviews.
ii) Interviewer bias. The advantage of flexibility leaves room for the interviewer’s
personal influence and bias, making an interview subject to interviewer bias.
iii) Lack of anonymity. Often the interviewer knows all or many of the respondents. Respondents
may feel threatened or intimidated by the interviewer, especially if a respondent is sensitive
to the topic or to some of the questions.
iv) Less accessibility
v) Inconvenience
vi) Often no opportunity to consult records, families, relatives
Disadvantages:
(i) Biased against households without telephone, unlisted number
(ii) Nonresponse
(iii) Difficult for sensitive issues or complex topics
(iv) Limited to verbal responses