Chapter 1
Chapter 1
Introduction to statistics
Chapter 1 – Introduction
Statistics is defined as the collection, processing, interpretation, and presentation of data.
The origin of statistics can be traced from the areas of (a) government and (b) games of
chance.
• Governments have long used censuses to count persons and properties for different
purposes including taxation, listing economic resources, etc.
• Games of chances date back, thousands of years ago. The use of dice was first
discovered in Egypt about 3500 B.C. However, the mathematical study of such
games began less than four centuries ago. In 1964, Blaise Pascal & Pierre de Fermat
(two mathematicians) identified a gambling problem and solved it independently
using different approaches. Until then the game of chances (known as probability
theory) was overlooked. Nowadays, the theory of probability is applied to many
problems in social and physical sciences. For example, measuring uncertainties or
risks of an event.
i. Primary data: This is the raw data collected by researchers (e.g. organisations,
person, authority, agency etc) through experiments, surveys, focus groups, interviews
and questionnaires.
ii. Secondary data: Is readily available data collected by someone else. It is available to
the public through publications, journals and newspapers.
Generally, primary sources are preferred compared to secondary datasets because the
possibility of errors of transcription is reduced. Primary sources are also accompanied by
documentation and precise definitions.
There are two types of statistical data: quantitative (or numerical) and qualitative (or
categorical) data.
There are four scales of measurement for statistical data which include nominal, ordinal,
interval and ratio scales.
Nominal Data: There exist no natural ranking or ordering in the data. For example, political
affiliations (UDC, BMD, AP, BDP), gender (Female/Male), etc.
Nominal Data: Provides an order, but there is no precise mathematical difference between
levels. For example, heat (low, medium, high), movie rating (1-star, 2-star), etc.
• Intervals of equal length signify equal differences in the characteristics. For example,
the difference between 100 𝑐𝑐 and 200 𝑐𝑐 is the same the difference between 900 𝑐𝑐 and
1000 𝑐𝑐.
• Difference makes sense, but ratios do not. For example, 100𝑜𝑜 𝑐𝑐 is not twice as hot as
500 𝑐𝑐.
• The scale does not have a ‘true zero’ starting point (i.e. it has an arbitrary zero).
Additionally, zero does not signify an absence of the characteristics, e.g. 0𝑜𝑜 𝑐𝑐 does not
represent the absence of heat.
Ratio Data: This is more meaningful than the interval data. Ratio data satisfies the
following conditions:
• Both differences and ratios are meaningful. For example, two 2ml glasses of water
is equivalent to one 4ml glass of water. We can also say, 4ml of water is twice as
much as 2 ml of water.
• The scale has a ‘true zero’ starting point. For example, 0 ml of water is a ‘true
zero’ as it is empty and means absence of water.
Data types
This is an important concept because statistical methods can only be used with certain data
types. For instance, you cannot analyse continuous data the same way categorical data is
analysed. The results would be wrong. Therefore, it is important to know data types to
enable the choice of correct methods of analysis.
Progression check
3. The total number of false alarms reported in week is which type of data?
(a) continuous
(b) ordinal
(c) discrete numeric
(d) nominal