Lesson № 1
We must define the problem, determine what data are needed, collect the data, and use
statistics to summarize the data and make decisions based on the data obtained. However,
populations are often so large that they are unwieldy to analyze, so time constraints make the
examination of a subset (sample) necessary. Population is the collection of all items of interest
( N ). Sample is an observed subset of the population ( n ). A parameter is a number that
describes something about the whole population. For example, the average length of a
butterfly. A statistic is a number that describes something about the whole sample. For
example, since it will be impossible to catch and measure all the butterflies in the world, we
can catch 100 butterflies and measure their length.
In statistics, sampling is a method of selecting the subset of the population to make statistical
inferences. Sampling can be classified into probability sampling and non-probability sampling.
Probability sampling refers to the selection of a sample from a population, when this
selection is based on the principle of randomization. Probability sampling is more complex,
more time-consuming and usually more costly than non-probability sampling.
Simple random sampling is a procedure in which each member of the population is chosen
strictly by chance, is equally likely to be chosen. Because of the randomization, any
research performed should have been at a lower risk for research biases like sampling bias
and selection bias.
S
ystematic method. In this method, the items are chosen the random selecting point and
picking the other methods. It involves the selection of every j’th item in the population,
where j is the ratio of the population size N to the desired sample size, n; that is, j = N/n.
Stratified sampling method is appropriate when you want to ensure that specific
characteristics are proportionally represented in the sample. When using stratified
sampling, the population is divided into homogeneous, mutually exclusive groups called
strata (for example, divided by gender or race), and then independent samples are
randomly selected from each stratum. If the K strata in the population contain N1, N2, …
Nk members, then N1 + N2 + g+ NK = N. Denote the numbers in the sample by n1, n2, …,
nK. Then the total number of sample members is as follows: n1 + n2 + g+ nK = n
Cluster sampling method divides the population into groups or clusters. A number of
clusters are selected randomly to represent the total population, and then all units within
selected clusters are included in the sample. No units from non-selected clusters are
included in the sample. This differs from stratified sampling, where some units are selected
from each stratum. It is used only if the samples share the same characteristics.
Non-random sample method is a method of selecting units from a population using non-
random method.
Convenient sampling assumes that the population units are all alike, then any unit may be
chosen for the sample.
Quota. Sampling is done until a specific number of quotas for various subpopulations have
been selected. For example, if there are 100 men and 100 women in the population and a
sample of 20 are to be drawn, 10 men and 10 women may be interviewed.
Errors describes the difference between a value obtained from a data collection process and
the 'true' value for the population. Data can be affected by sampling and non-sampling error.
Sampling errors occurs when the proportions of different characteristics within the sample
are not similar to the proportions of the characteristics for the whole population. If we are
taking a sample of men and women and we know that 51% of the total population are women
and 49% are men, then we should aim to have similar proportions in our sample. Increasing
the sample size will reduce the sample error.
Non-sampling errors is caused by human factors when
The population actually sampled is not the relevant one.
Survey subjects may give inaccurate or dishonest answers. This could happen because
questions are phrased in a manner that is difficult to understand or in a way that appears
to make a particular answer seem more palatable or more desirable.
Survey subjects may not respond at all, or they may not respond to certain questions.
After we identify and define a problem, we collect data produced by various processes, and
then we analyze that data using one or more statistical procedures. From this analysis, we
obtain information. Information is converted into knowledge, using understanding based on
specific experience, theory, literature, and additional statistical procedures. Statistics is
divided into descriptive and inferential statistics.
Descriptive statistics focus on graphical and numerical procedures that are used to summarize
and process data. Inferential statistics focus on using the data to make predictions, forecasts,
and estimates to make better decisions.
Classification of variables
A variable is a specific characteristic (such as age or weight) of an individual or object. When
collecting the data for research, it is important to know the form of the data to interpret and
analyze it effectively. In a research study, there are mainly two types of data: Categorical and
numerical.
Categorical (qualitative) data (frequency distribution table, bar charts, pie charts, Pareto
diagrams) refers to a data type that can be stored based on the names or labels given to them.
Nominal data are words that describe the categories or classes of responses. Responses
to yes/no questions are categorical. Example: person’s name, gender, school name, do you
own a car?
Ordinal data indicate the rank ordering of items, and similar to nominal data the values
are words that describe responses. Some examples of ordinal data are product quality
rating (1: poor; 2: average; 3: good); satisfaction rating with your current Internet provider
(1: very dissatisfied; 2: moderately dissatisfied; no opinion; 4: moderately satisfied; 5: very
satisfied); consumer preference among three different types of soft drink (1: most
preferred; second choice; 3: third choice).
Numerical (quantitative) data (line charts, frequency distribution table, histograms, ogives,
stem-and-leaf displays, scatter plot) is collected in number form and stands different from any
form of number data type due to its ability to be statistically and arithmetically calculated.
Numerical variables include:
A discrete numerical variable is used to represent countable items. It can take both
numerical and categorical forms and group them into a list. This list can be finite or
infinite, too.
A continuous numerical variable represents measurements; their intervals fall on a
number line. Hence, it doesn’t involve taking counts of the items. Someone might say that
he is 6 feet (or 72 inches) tall, but his height could actually be 72.1 inches. Other examples
of continuous numerical variables include the weight of a cereal box, the time to run a
race, the distance between two cities, or the temperature. Continuous numerical data is
further divided into two categories:
Interval variable data type refers to data that can be measured only along a scale at
equal distances from each other. Suppose it is 30°C in Pune, India, and only 10°C in
Tokyo, Japan. We can conclude that the difference in temperature is 20°, but we
cannot say that it is three times as warm in Pune as it is in Tokyo.
Ratio data indicate both rank and distance from a natural zero, with ratios of two
measures having meaning. A person who weighs 200 pounds is twice the weight of a
person who weighs 100 pounds; a person who is 40 years old is twice the age of
someone who is 20 years old.
Graphs that describe categorical variables. We can describe categorical variables using
frequency distribution tables and graphs such as bar charts, pie charts, and Pareto diagrams.
These graphs are commonly used by managers and marketing researchers to describe data
collected from surveys and questionnaires.
A frequency distribution is a table used to organize data. The left column (called classes or
groups) includes all possible responses on a variable being studied. The right column is a list of
the frequencies for each class. A relative frequency distribution is obtained by dividing each
frequency by the number of observations and multiplying the resulting proportion by 100%.
If our intent is to draw attention to the frequency of each category, then we will most likely
draw a bar chart. In a bar chart the height of a rectangle represents each frequency.
If we want to draw attention to the proportion of frequencies in each category, then we will
use a pie chart to depict the division of a whole into its constituent parts. The pie represents
the total, and the segments cut from its center depict shares of that total. The pie chart is
constructed so that the area of each segment is proportional to the corresponding frequency.
A Pareto diagram is a bar chart that displays the frequency of defect causes. The bar at the left
indicates the most frequent cause and the bars to the right indicate causes with decreasing
frequencies. A Pareto diagram is used to separate the “vital few” from the “trivial many.” We
arrange the bars in a Pareto diagram from left to right to emphasize the most frequent causes
of defects.
Graphs that describe numerical variables. We can describe numerical variables using
frequency distribution tables, line charts, frequency distribution table, histograms, ogives,
stem-and-leaf displays, scatter plot.