LU1 Lecture Notes
LU1 Lecture Notes
Learning objectives
• Understand the concepts of a population, sample, parameter, statistic, random variable
and data
• Distinguish between descriptive and inferential statistics
• Identify data types and measurement scales
• Know the difference between raw data and frequency data
• Understand sigma notation
Textbook reference
• Chapter 1
o §1.1 – §1.3, §1.7
o Exclude §1.4 – §1.6, §1.8 – §1.11
ATE01A1 – LU 1 1
INTRODUCTION
An essential part of the scientific research process is gathering, ordering, and analysing
information from which conclusions can be drawn and interpretations can be made. The
study of statistical methods focuses on how the data should be analysed so that meaningful
conclusions can be drawn.
To perform statistical analyses, one must first understand the language of statistics. In this
section, we will define basic statistical terminology and concepts.
ATE01A1 – LU 1 2
Population
Parameter
ATE01A1 – LU 1 3
Sample
A sample is a subset of the population of interest. Samples are generally used to collect
information since considering the entire population is not always possible or feasible. The
Statistic
A number calculated from sample data, which describes a measurable aspect of a sample,
is called a statistic. Sample statistics are generally denoted using Roman letters.
Sampling unit
A sampling unit is an object being measured, counted, or observed.
ATE01A1 – LU 1 4
Random variable
A variable is a characteristic of the elements of a population (or sample) for which the
observed values differ from element to element.
• In probability theory, where a variable assumes specific values with certain associated
probabilities, the variable is called a random variable.
• Variables are denoted by capital letters, e.g. X, Y, Z, and the values assumed by the
random variables are denoted by lowercase letters, e.g. x, y, z.
For example, let X = the height of boys in metres. Here, X is a random variable, which
measures the variable “height”. If three boys are selected at random, i.e. n = 3, and their
respective heights are 1.40m, 1.37m and 1.41m, the realisations of the random variable X
is given by 𝑥𝑖 for i = 1, 2, 3:
𝑥1 = 1.40 𝑥2 = 1.37 𝑥3 = 1.41
ATE01A1 – LU 1 5
Data
The actual values (numbers) or outcomes of all variables measured from the data.
Descriptive statistics
Descriptive statistics comprise those methods used to organise and describe information
that has been collected in a sample.
Inferential statistics
Inferential statistics comprise those methods and techniques used for making
generalisations, predictions or estimates about the population using sampled data.
ATE01A1 – LU 1 6
Notation
ATE01B1 – LU 1 7
Exercise 1.1
Consider the results of the three semester tests for ATE A:
7) To test whether the current group of ATE students performs better than groups from
previous years is the process of:
ATE01B1 – LU 1 8
UNDERSTANDING DATA
Data Types
Determining the most appropriate statistical method depends firstly on the problem
statement to be addressed and secondly on the type of data available. Specific statistical
methods are valid for certain data types only. Data types are identified by the nature of their
random variables. A random variable is categorical (qualitative) or numeric (quantitative).
ATE01B1 – LU 1 9.
1. Categorical random variables
Categorical variables are also known as qualitative variables. Such variables allow for
classification based on some characteristic. The variable ‘Eye colour’ can be classified as
brown, blue, green or grey.
The values of categorical variables are often recorded as numerical values. For example,
gender might be coded in the dataset as 1 = Male and 2 = Female, but these values have
no numerical meaning as they denote labels or categories of the variable. Such categorical
data can, therefore, only be counted to determine how many responses belong to each
category.
ATE01B1 – LU 1 10.
2. Numerical random variables
Numerical variables are also known as quantitative variables. Such variables are naturally
measured as numbers. For example, a person’s height in centimetres. Arithmetic
operations can be performed on the variables as the values have numerical meaning.
ATE01B1 – LU 1 11.
Measurement Scales
Data can also be classified in terms of its scale of measurement, i.e., the procedure used to
measure or obtain the data. There are four types of measurement scales: nominal, ordinal,
interval, and ratio.
1. Nominal
For example, a person’s eye colour could be brown, blue, green or grey. There is no logical
way in which these four categories can be ordered. Nominal data is, therefore, usually
ordered alphabetically and then assigned a numeric value.
ATE01B1 – LU 1 12.
2. Ordinal
For example, a person’s age is classified as 1 = young, 2 = middle-aged, 3 = old. The three
possible values of this variable are ordered logically.
Note: in this example, numbers are used to reflect the measurement in order from low to
high, without any numeric meaning attached to the values.
ATE01B1 – LU 1 13.
3. Interval
• there is no true or absolute zero, i.e., the value of zero is an arbitrary reference point
For example, temperature in degrees Celsius. The values are numerical and ordered. A
temperature of 0°C does not mean an absence of temperature, i.e., the scale has an
arbitrary zero value. The difference between 10°C and 20°C is the same as the difference
between 30°C and 40°C, namely a 10-degree difference. However, 20°C is not twice as hot
as 10°C, i.e., ratios are not meaningful.
ATE01B1 – LU 1 14.
4. Ratio
For example, the amount of money in a bank account in Rand. The values are numerical
and ordered. An amount of R0 implies an absence of money, i.e., the scale has an absolute
zero value. The difference between R10 and R20 is the same as that between R30 and
R40, namely a R10 difference. R20 is twice as much money as R10, i.e., ratios are
meaningful.
ATE01B1 – LU 1 15.
Exercise 1.2
Data were collected from a random sample of 20 coffee consumers. The survey yielded the
following variables and data.
Consumer ID Household Daily coffee Coffee type Choice of Coffee affinity
Gender Age Highest qualification
number size consumption preference brand rating score
1 Male 24 Tertiary certificate 4 3 Instant 2 2.3
2 Male 26 Degree/Diploma 2 1 Instant 1 1.9
3 Female 25 Degree/Diploma 3 2 Filter 1 0.8
4 Female 30 Less than matric 5 7 Instant 5 4.4
5 Male 35 Tertiary certificate 1 4 Instant 3 3.1
6 Male 21 Tertiary certificate 1 1 Filter 3 0.4
7 Male 24 Degree/Diploma 4 2 Instant 4 1.8
8 Male 19 Matric 1 1 Filter 4 0.4
9 Female 28 Postgraduate degree 2 3 Instant 2 3.1
10 Female 34 Matric 3 2 Instant 1 1.9
11 Male 37 Tertiary certificate 2 5 Instant 1 4.9
12 Female 40 Postgraduate degree 5 2 Filter 3 0.6
13 Male 29 Degree/Diploma 4 1 Instant 1 0.1
14 Male 35 Degree/Diploma 2 4 Filter 5 3.6
15 Female 29 Matric 3 1 Filter 4 1
16 Male 19 Matric 6 2 Instant 4 1.4
17 Female 32 Degree/Diploma 1 3 Filter 3 2.4
18 Male 19 Less than matric 2 5 Instant 2 3.4
19 Female 26 Tertiary certificate 5 2 Instant 3 0.2
20 Female 36 Postgraduate degree 3 8 Instant 2 4.6
Daily coffee consumption = Number of cups
Choice of brand rating: 1 = Not important, 2 = Somewhat important, 3 = Important, 4 = Relatively important, 5 = Very important.
Coffee affinity score: derived from other information (one or more variables were combined) to calculate this new variable.
ATE01B1 – LU 1 16.
For each variable, identify the type and the scale of measure.
ATE01B1 – LU 1 17.
Data Formats
1. Raw data
Raw data refers to unprocessed information, also known as source data or primary data. All
information collected is first represented in raw data format, i.e., the dataset. As in the
previous example, the dataset is shown in a matrix format with rows and columns.
Variables are given in the columns, and observations are presented in the rows. A sample
of n observations and p variables will yield a dataset with n rows and p columns.
The steps to enter raw data into the calculator are as follows:
4) AC
ATE01B1 – LU 1 18.
2. Frequency data
Frequency data are raw data in an aggregated format where individual or a range of data
values are listed with a count of the number of times each value/range appears in the
dataset. This count is the frequency of occurrence or simply the frequency. It shows how
the data are distributed across the scale. Frequency data provide an overview of the
sampled information. Univariate frequency data represent counts of a single variable, and
bivariate frequency data represent counts of the combination of two variables. Steps to
enter frequency data into the calculator are given in Learning Unit 2.
ATE01B1 – LU 1 19.
Sigma Notation (self-read)
σ3𝑖=1 𝑥𝑖 = σ 𝑥 = 𝑥1 + 𝑥2 + 𝑥3 = 2 + 3 + 5 = 10
Note: the square of the sum is not equal to the sum of squares, i.e. σ 𝑥 2 ≠ σ(𝑥 2 )
2
Example: σ 𝑥 = (𝑥1 +𝑥2 + 𝑥3 )² = (2 + 3 + 5)² = (10)² = 100
ATE01B1 – LU 1 20.