Chapter One: Definition of Statistics: The Word "Statistics" Has Different Meanings To Different Person's .When
Chapter One: Definition of Statistics: The Word "Statistics" Has Different Meanings To Different Person's .When
1. Introduction
Statistical thinking has now a day became very essential for different fields of study. Its
usefulness has now spread to such diverse fields as agriculture, business, accounting, marketing,
economics, management, medicine, political science, psychology, sociology, engineering,
journal, metrology, tourism, etc. For this reason, statistics is now included in the curriculum of
many professional and academic study programs.
This chapter introduces the subject matter of statistics, the art of learning from data. It describes
the two branches of statistics, descriptive and inferential. The idea of learning about a population
by sampling and studying certain of its members is described.
Definition of Statistics: The word “statistics” has different meanings to different person’s .When
most people hear the word they think of tables of figures giving births, deaths, marriages,
divorces, accidents etc. Some people think of statistics as information about an activity (like
production, population, national income etc) expressed in numbers. Still some others think of the
term statistics as a subject or as a body of knowledge like other sciences.
Even if the word Statistics has different meaning for different individuals, the common usage of
the word “statistics” has, two meanings. In one sense “statistics” is the plural form of “statistics”
and refers to the numerical facts and figures collected for a certain purposes. In the other sense;”
statistics” refers to a field of study or to a body of knowledge or to a subject that is concerned
with systematic collection and interpretation of numerical data to make a decision. In this sense
the word statistics is singular. Thus,
Classification of Statistics:
Anyone can apply statistical techniques to, virtually, every branch of science and art. These
techniques are so diverse. So, statisticians commonly classify them into the following two broad
categories such as descriptive statistics and inferential statistics.
Descriptive Statistics:
It is an area of statistics which is mainly concerned with the methods and techniques
used in collection, organization, presentation, and analysis of a set of data without making
any conclusions or inferences.
According to the above definition the activities in the area of “Descriptive Statistics” includes
• Gathering data
• Recording a student’s grades throughout the semester and then finding the average of
these grades.
• From sample we have 40% employee suggest positive attitude toward the management of
the organization.
• Drawing graphs that show the difference in the scores of males and females.
All the above examples simply summarize and describe a given data. Nothing is inferred
or concluded on the basis of the above description.
Inferential Statistics:
Inferential statistics is an area of statistics which deals with the method of inferring or
drawing conclusion about the characteristics of the population based upon the results of a
sample.
Statistics is concerned not only with collection , organization , presentation and analysis of
data but also with the inferences which can be made after the analysis is completed. In
collecting data concerning the characteristics of a set of elements, or the element can even
be infinite. Instead of observing the entire set of objects, called the population, one observes a
subset of the population called a sample. Hence, inferential statistics utilizes sample data to make
decision for entire data set.
Examples:
(b) “There is a definitive relationship between smoking and lung cancer”. This statement is the
result of continuous research of many samples taken and studied. Therefore, it is an
inference made from sample results.
(c) As a result of recent reduction in oil production by oil producing nations , we can
expect the price of gasoline to double up in the next year. (This is also an example of
inference from sample survey).
The following basic statistical terms are used frequently in this study.
Population: A population is a totality of things, objects, peoples, etc about which information is
being collected. It is the totality of observations with which the researcher is concerned.
Census survey: It is the process of examining the entire population. It is the total count of the
population.
Sample: A sample is a subset or part of a population selected to draw conclusions about the
population.
Sampling Frame: It is a list of people, items or units from which the sample is taken.
Parameter: It is a descriptive measure (value) computed from the population. It is the population
measurement used to describe the population.
Statistic: It is a measure used to describe the sample. It is a value computed from the sample.
Data: It referred to a collection of related facts and figures from which conclusions may be
drawn.
Variable: A certain characteristic which changes from object to object and time to time.
Note: Censes survey (studying the whole population without considering samples) requires a
great deal of time, money and energy. Trying to study the entire population is in most cases
technically and economically not feasible. Hence, usually we will take a representative sample
out of the population on the basis of which we draw conclusions about the entire population and
we call it sampling survey. Therefore, sampling survey would have the following merits
The scope of statistics is indeed very vast. Apart from helping elicit an intelligent assessment
from a body of figures and facts, statistics is indispensable tool for any scientific enquiry-right
from the stage of planning enquiry to the stage of conclusion. It applies almost all sciences: pure
and applied, physical, natural, biological, medical, agricultural and engineering. It also finds
applications in social and management sciences, in commerce, business and industry.
Of social sciences, economics leans most heavily on statistical methods for analyses of data
relating to micro as well as to macro economics, from demand analyses up to national income
analyses.
Today the field of statistics is recognized as a highly useful tool to making decision process by
managers of modern business, industry, frequently changing technology. It has a lot of functions
in everyday activities. The following are some of the most important uses of statistics.
• Statistics condenses and summarizes complex data. The original set of data (raw data) is
normally voluminous and disorganized unless it is summarized and expressed in few
numerical values.
• Statistics facilitates comparison of data. Measures obtained from different set of data
can be compared to draw conclusion about those sets. Statistical values such as averages,
percentages, ratios, etc, are the tools that can be used for the purpose of comparing sets of
data.
• Statistics helps in predicting future trends. Statistics is extremely useful for analyzing
the past and present data and predicting some future trends.
• Statistics influences the policies of government. Statistical study results in the areas of
taxation, on unemployment rate, on the performance of every sort of military equipment,
family planning, etc, may convince a government to review its policies and plans with the
view to meet national needs and aspirations.
• Statistical methods are very helpful in formulating and testing hypothesis and to develop
new theories and so on.
Even if, statistics is widely used in various fields of natural and social sciences and engineering
as well, which closely related with human inhabitant. It has its own limitations as far as its
application is concerned. Some of these limitations are the following:
• Statistics doesn’t deal with single (individual) values. Statistics deals only with
aggregate values. But in some cases single individual is highly important to consider in
some situations. Example, the sun, a deriver of bus, president, etc.
• Statistics can’t deal with qualitative characteristics. It only deals with data which can be
quantified. Example, it does not deal with marital status (married, single, divorced,
widowed) but it deal with number of married, number of single, number of divorced.
• Statistical conclusions are not universally true. Statistical conclusions are true only
under certain condition or true only on average. The conclusions drawn from the analysis
of the sample may, perhaps, differ from the conclusions that would be drawn from the
entire population. For this reason, statistics is not an exact science.
Example: Assume that in your class there is 50 numbers of students. Take their CGPA
for all 50 students and analysis mean CGPA; that is assumed 3.00. This value is on
average, because all individual has not CGPA 3.00. There is a student who has scored
above 3.00 and below 3.00.
• Statistics can be misused. Sometimes statistical figures can be misleading unless they are
carefully interpreted.
Before we deal with statistical investigation, let us see what statistical data mean. Each and every
numerical data can’t be considered as statistical data unless it possesses the following criteria.
These are:
Formulating the problem:- First research must emanate if there is a problem. At this stage the
investigator must be sure to understand the problem and then formulate it in statistical term.
Clarify the objectives very carefully. Ask as many questions as necessary because “An
approximate answer to the right question is worth a great deal more than a precise answer to the
wrong question”.
Therefore,
• Get a clear understanding of the physical background to the situation under study.
Data Collection: This is a stage where we gather information for the intended purpose
• Data may be collected by the investigator directly using methods like interview,
questionnaire, observation or it may be taken from published or unpublished sources.
Data Organization: It is a stage where we edit our data. A large mass of figures that are
collected from surveys frequently need organization. The collected data involve irrelevant
figures, incorrect facts, omission and mistakes. Errors that may have been included during
collection will have to be edited .After editing, we may classify (arrange) according to there
common characteristics. Classification or arrangement of data in some suitable order makes the
information easy for presentation.
Data Presentation: The organized data can now be presented in the form of tables and diagram.
At this stage, large data will be presented in tables in a very summarized and condensed manner.
The main purpose of data presentation is to facilitate statistical analysis. Graphs and diagrams
may also be used to give the data a vivid meaning and make the presentation attractive.
Data Analysis: This is the stage where we critically study the data to draw conclusions about the
population parameter. The purpose of data analysis is to dig out information useful for decision
making. Analysis usually involves highly complex and sophisticated mathematical techniques.
However, in this study only the most commonly used methods of statistical analysis are included.
Such as averages, the main measures of dispersion, regression and correlation analysis will be
considered.
Data Interpretation: This is the stage where draw valid conclusions from the results obtained
through data analysis. Interpretation means drawing conclusions from the data which form the
basis for decision making. The interpretation of data is a difficult task and necessitates a high
degree of skill and experience. If data that have been analyzed are not properly interpreted, the
whole purpose of the investigation may be defected and fallacious conclusion be drawn. So that
great care is needed when making interpretation.
Variables and Attributes: A variable in statistics is any characteristic, which can take on
different values when data are collected. Conventionally, the quantitative variables are termed
as variables and qualitative variables are termed as attributes.
Qualitative Variables: are variables that can be placed into distinct categories according to some
characteristics or attribute.
Quantitative Variables: are variables that are numeric in nature and can be ordered or ranked.
Discrete random variables are variables which assume values that can be counted (takes always
values is whole number). The values are obtained by counting.
Example: Variables such as number of students, number of errors per page, number of
accidents on traffic line, number of defective or non defective items produced in production line.
Continuous random variables are quantitative variables which assume values between any two
specific values (or take any decimal value). They are obtained by measuring.
Example: age, time, height, income, price, temperature, length, volume, rate, amount of
rainfall, etc. are continuous variables.
Measurement scales:
Normally, when one hears the term measurement, they may think in terms of measuring the
length of something (i.e. the length of a piece of wood) or measuring a quantity of something
(i.e. a cup of flour). This represents a limited use of the term measurement. In statistics, the term
measurement is used more broadly and is more appropriately termed as scales of measurement
(Measurement scales). Scales of measurement refer to ways in which variables or numbers are
defined and categorized. Each scale of measurement has certain properties which in turn
determine the appropriateness for use of certain statistical analyses. The four scales of
measurement are nominal, ordinal, interval, and ratio.
The various measurement scales results from the facts that measurement may be carried out
under different sets of rules.
Nominal Scale:- Consists of ‘naming’ observations or classifying them into various mutually
exclusive categories. Sometimes the variable under study is classified by some quality it
possesses rather than by an amount or quantity. In such cases, the variable is called attribute.
Example: Religion (Christianity, Islam, Hinduism, etc); Sex (Male, Female)
Eye color (brown, black), Blood type (A, B, AB and O) etc.
Ordinal Scale: - Whenever observations are not only different from category to category, but
can be ranked according to some criterion. The variables deal with their relative difference
rather than with quantitative differences.
Ordinal data are data which can have meaningful inequalities. The inequality signs < or > may
assume any meaning like ‘stronger, softer, weaker, better than’, etc.
Examples:
•Patients may be characterized as unimproved, improved & much improved.
•Letter grading system, authority, career, etc
CHAPTER TWO
Advantages:
• Can be used as a method in its own right or as a basis for interviewing or a telephone
survey.
• Can be posted, e-mailed or faxed.
• Can cover a large number of people or organizations.
• Wide geographic coverage.
• Relatively cheap.
• No prior arrangements are needed.
• Avoids embarrassment on the part of the respondent.
• Respondent can consider responses.
• Possible anonymity of respondent.
• No interviewer bias.
Disadvantages:
• Historically low response rate (although inducements may help).
• Time delay whilst waiting for responses to be returned
• Require a return deadline.
• Several reminders may be required.
• Assumes no literacy problems.
• No control over who completes it.
• Not possible to give assistance if required.
• Replies not spontaneous and independent of each other.
• Respondent can read all questions beforehand and then decide whether to complete or
not. For example, perhaps because it is too long, too complex, uninteresting, or too
personal.
Mailed Questionnaire Method: In the mailed questionnaire method a questionnaire in the form
of a set of questions is sent by mail to the informants. They are expected to answer the questions
and also the additional needed information, whenever needed and mail them back to the
investigator. This method should be used in the following occasions
i. When the area under investigation is wide.
ii. When the informants are educated.
iii. When the informants are expected to leave for faraway places.
Schedule through Enumerators: Initially let us make a distinction between a questionnaire and
a schedule. The questionnaire is a set of questions the answers to which are recorded by the
informant itself, whereas in a schedule answers are recorded by the investigator or an enumerator
on his behalf.
In this method the investigators or enumerators approach the informants with a prepared
questionnaire and get the replies to the questions. This method is generally used in censuses and
large scale surveys. In the case of census, investigators visits every member of the source of
information in the zones while in the case of sample survey, they collect information from those
members who have been selected in the sample.
Interviewing:
It is a technique that is primarily used to gain an understanding of the underlying reasons and
motivations for people’s attitudes, preferences or behavior. Interviews can be undertaken on a
personal one-to-one basis or in a group. They can be conducted at work, at home, in the street or
in a shopping center, or some other agreed location.
Advantages:
• Serious approach by respondent resulting in accurate information.
• Good response rate.
• Completed and immediate.
• Possible in-depth questions.
• Interviewer in control and can give help if there is a problem.
• Can investigate motives and feelings.
• Can use recording equipment.
• Characteristics of respondent assessed – tone of voice, facial expression, hesitation, etc.
• If one interviewer used, uniformity of approach.
• Used to pilot other methods.
Disadvantages:
• Need to set up interviews.
• Time consuming.
• Geographic limitations.
• Can be expensive.
• Normally need a set of questions.
• Respondent bias – tendency to please or impress, create false personal image, or end
interview quickly.
• Embarrassment possible if personal questions.
• Transcription and analysis can present problems– subjectivity.
• If many interviewers, training required.
Type of Tables
Tables can be classified according to their purpose, stage of enquiry, nature of dam or number of
characteristics used. On the basis of the number of characteristics, tables may be classified as
follows: 1. Simple or one-way table 2. Two-way table or contingency table.
3. Manifold table or higher order table.
Simple or One-Way Table
A simple or one-way table is the simplest table, which contains data of one characteristic only. A
simple table is easy-to construct and simple to follow. For example, the adjacent blank table may
be used to show the number of adults in different occupations in a loca1ity.
The number of adults in different occupations in a locality
Occupations No. of adults
Employee
Farmer
Total
Two-Way Table
Atable which contains data on two characteristics is called a two-way table. In such case,
therefore, eitherstub or caption is divided into two co-ordinate parts. In the given table, as an
example, the caption may be further divided in respect of sex.
This sub-division is shown in the adjacent two-way table, which now contains two characteristics
namely, the occupation and sex.
Table: The number of adults in a locality in respect of occupation and sex
No. of Adults
Occupation Male Female Total
Employee
Farmer
Total
Manifold (higher order) Table
Thus, including other characteristics can form more and more complex tables. For example we
may further classify the caption sub-headings in the above table in respect of ‘ marital status’,
and ‘socio-economic status’ etc. A table, in which more than two characteristics of data are
considered, is called a manifold table. For instance, table below shows three characteristics,
namely, occupation, sex and marital status.
Table: The number of adults in a locality in respect of occupation, sex and marital status
No. of Adults
Occupations Male Female Total
M U Total M U Total
Employee
Farmer
Total
Foot note- M stands for Married and U for Unmarried.
Manifold tables, though complex, are good in practice as these enable full information to be
incorporated and facilitate an analysis of all rehired facts. Still, as a normal practice, not more
than four characteristics should be represented in one table to avoid confusion. Other related
tables may be formed to show the remaining characteristics.
• Frequency Distributions: Qualitative, Quantitative: Absolute, Relative and Percentage.
The Frequency Distribution
Frequency:- is the number of times a certain value or group of values or categories/qualities/
repeated in a given set of data and frequency distribution is the organization of raw data in table
form, using classes and frequencies.
Example 2.1: last year, from thirty persons, blood test was taken and the following blood groups
were obtained. Construct an appropriate frequency distribution for these data.
B B AB B A AB
O AB AB AB B B
B A B AB O AB
A O B O AB A
B AB AB A AB O
There are four kinds of blood groups: A, B, AB, and O, which may be used as the classes for
constructing the distribution. The procedure for constructing a frequency distribution for
categorical data is given below.
Frequency distribution of Blood type
Blood Type A B AB O
Tally //// //// //// ////
//// //// /
Frequency 5 9 11 5
Example 2.3: The following data are the serum triglyceride levels (mg/dl) of 30 male of Biology
students who have measured his blood last year: 30, 41, 39, 41, 32, 29, 35, 31, 30, 36, 33, 36, 32, 42,
30, 35, 37, 32, 30, and 41.Construct an appropriate frequency distribution for these data.
Example 2.4.
A demographer is interested in the number of children a family may have and took a sample of
30 families and obtained the following observations.
No of children 2 3 4 5 6 7 8 Total
No of family (frequency) 5 7 8 4 1 2 3 30
• Upper Class Limits are the largest number that can belong to the different classes
• Units of measurement (U): The distance between two possible consecutive measures. It is
usually taken as 1, 0.1, 0.01, 0.001, -----.
• Class boundaries (true class limits): Separate one class in a grouped frequency distribution
from another. The boundaries have one more decimal place than the raw data and therefore
do not appear in the collected data. There is no gap between the upper boundary of one class
and the lower boundary of the next class. The lower class boundary (LCB) is found by
subtracting U/2 from the corresponding lower class limit (LCL) and the upper class boundary
(UCB) is found by adding U/2 to the corresponding upper class limit (UCL).
• Class width (W): the difference between the upper and lower boundaries of any class or the
lower limits of two consecutive classes, or the upper limits of two consecutive classes.
N.B. Class width is not equal to the difference between UCL and LCL of the same class.
• Class Mid-point (Class mark or Mid points):When we add up the lower and the upper
class limits of a class interval, we get a certain value. This value is divided by two, which
gives us the class mid-point. Thus, the mid-point of class interval 40-60 is (40+60)/2 = 50.
The formula for obtaining class mid-point is as follows:
Mid-point (mi) =
As we shall see subsequently, the mid-point of each class interval is taken to represent it for the
purpose of statistical calculations.
• Cumulative frequency (C. f) less than type: the total frequency of all values (observations) less
than or equal to the upper class boundary for the given class.
• Cumulative frequency (C f) more than type: The total frequency of all values (observations)
greater than or equal to the lower class boundary for the given class. A tabular arrangement of
class intervals together with their corresponding cumulative frequency (either less than or
more than type; as defined above) is called a cumulative frequency distribution.
• Relative frequency: the frequency a class divided by the total frequency (i.e. sum of all
frequencies) and, if multiplied by 100, gives the percent of values falling in that class.
Note: 1.The relative frequency shows what fractional part or proportion of the total frequency
belongs to the corresponding class.
• The sum of all the relative frequencies in the frequency distribution is always 1.
• Relative cumulative frequency (less than type/ more than type): total of the relative frequencies
above/ below a class inclusively. Or the cumulative frequency (less than type/more than type)
divided by the total frequency. This gives the percent of values, which are less than/more than
the upper/lower class boundary.
• Find the maximum (Max) and the minimum (Min) observation, and then compute their
range, R,
• Fix the number of classes’ desired (k). There are two ways to fix k:
• Fix k arbitrarily between 5 and 20, or
Example 2.5: A sample of 20 fishes was taken at random from a fish pond. The following oxygen
consumption by each fish was measured and recorded. Construct a frequency distribution for these
data.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27
Example : The following data are on age of 20 women who attended health education in a certain
hospital. Construct frequency distribution (Relative Frequency and Percentage Distributions) by
using sturge’s rule.30, 25, 23, 41, 39, 27, 41, 24, 32, 29, 35, 31, 36, 33, 36, 42, 35, 37, 41, and 29
Example: - Construct both “less than” and “more than “cumulative frequency distribution for
the following data on weekly wage distribution for 201 workers.
Weekly wage 0-20 20-40 40-60 60-80 80-100
No of workers 41 51 64 38 7
Table Cumulative Frequency Distribution
Weekly wage more than No of workers weekly wage less than No of workers
0 201 20 7
20 194 40 94
40 156 60 156
60 94 80 194
80 7 100 201
2.2.4 Diagrammatic Presentation of Data
Grouping of data or too many figures in a table do not always appeal to a common man as too
many figures are generally confusing and fail to convey the definite pattern or trend of the
figures. Diagrammatic Presentation of Data refers to techniques for presenting data in visual
displays using geometry and pictures.
They are important because
• They have greater attraction.
• They facilitate comparison.
• They are easily understandable.
Diagrams are appropriate for presenting discrete data and the three most commonly used
diagrammatic presentation for discrete as well as qualitative data are: Pie charts, Pictogram and
Bar charts
Pie chart
A pie chart is a circle that is divided into sections or sectors according to the percentage of
frequencies in each category of the distribution.
The angles of each component are calculated by the formula:
These angles are made in the circle by mean of a protractor to show different components. The
arrangement of the sectors is usually anti-clock wise.
Example2.6: The following table gives the details of monthly budget of a family. Represent
these figures by a suitable diagram.
Pictogram
In these diagrams, we represent data by means of some picture symbols. We decide about a
suitable picture to represent a definite number of units in which the variable is measured. The
following table shows the orange production in a plantation from production year 1990-1993.
Table: Orange productions from 1990 to 1993
Year Amount in Kg
1990 3000
1991 3850
1992 3500
1993 5000
Bar Charts:
A set of bars (thick lines or narrow rectangles) representing some magnitude over time space.
They are useful for comparing aggregate over time space. Bars can be drawn either vertically or
horizontally. Usually horizontal bar-diagrams are used for qualitatively classified data whereas
vertical bar-diagrams are used for quantitatively classified data.
EXAMPLE: horizontal bar diagram of blood group of a person
Blood type frequency
A 9
B 14
AB 10
O 17
• Component Bar-diagrams
When it is desired to show how a total (an aggregate) is divided into component parts, we use
component bar diagram. In such type of bar-diagrams, the bars represent aggregate value of a
variable with each aggregate broken into its component parts and different colors or designs are
used for identification.
• Multiple bar-diagrams
Multiple bar-diagrams are used to display data on more than one variable. They are used for
comparing different variables at the same time.
Frequency Graphs
The types of frequency graphs are normally used are histogram, Polygon, and Ogive.
Histogram : It is a special type of bar graph in which the horizontal scale represents classes of data
values and the vertical scale represents frequencies. The height of the bars correspond to the frequency
values, and the drawn adjacent to each other (without gaps).A histogram can be constructed after we
have first completed a frequency distribution table for a data set. The x axis is reserved for the class
boundaries.
Example: Construct a histogram for the frequency distribution of the time spent by the
automobile workers.
Table: Time in minutes spent by
Histogram representing Time in Minutes spent by auto workers
automobile workers
Time Class Number of
(in minute) mark workers
15.5- 21.5 18.5 3
21.5-27.5 24.5 6
27.5-33.5 30.5 8
33.5-39.5 36.5 4
39.5-45.5 42.5 3
45.5-51.5 48.5 1
Relative frequency histogram has the same shape and horizontal () scale as a histogram, but the vertical ()
scale is marked with relative frequencies instead of actual frequencies.
Frequency Polygon
A frequency polygon uses line segment connected to points located directly above class midpoint values.
The heights of the points correspond to the class frequencies, and the line segments are extended to the
left and right so that the graph begins and ends on the horizontal axis with the same distance that the
previous and next midpoint would be located.
• Frequency Polygon
A frequency polygon is a line graph drawn by taking the frequencies of the classes along the vertical
axis and their respective class marks along the horizontal axis. Then join the cross points by a free
hand curve.
Example: Draw a frequency polygon presenting the following data.
Class Frequency c.f(less cf (more
Boundaries than than type)
type)
5.5 – 11.5 2 2 20
11.5 – 17.5 2 4 18
17.5 – 23.5 7 11 16
23.5 – 29.5 4 15 9
29.5 – 35.5 3 18 5
35.5 – 41.5 2 20 2
ii. Cumulative Frequency Polygon (Ogives)
Cumulative frequency polygon can be traced on less than or more than cumulative frequency basis.
Place the class boundaries along the horizontal axis and the corresponding cumulative frequencies
(either less than or more than cumulative frequencies) along the vertical axis. Then join the cross
points by a free hand curve.
Example: the data in the previous example can be presented using either a less than or a more than
cumulative frequency polygon as given below (i) and (ii) respectively.