0% found this document useful (0 votes)
4 views

LU1 Lecture Notes

This document serves as an introduction to statistics, covering fundamental concepts such as population, sample, parameter, and statistic, as well as the distinction between descriptive and inferential statistics. It also discusses data types, measurement scales, and the importance of understanding statistical terminology for effective data analysis. Key components include definitions of random variables, data formats, and sigma notation.

Uploaded by

sadpost787
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

LU1 Lecture Notes

This document serves as an introduction to statistics, covering fundamental concepts such as population, sample, parameter, and statistic, as well as the distinction between descriptive and inferential statistics. It also discusses data types, measurement scales, and the importance of understanding statistical terminology for effective data analysis. Key components include definitions of random variables, data formats, and sigma notation.

Uploaded by

sadpost787
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

LEARNING UNIT 1: Introduction to Statistics

Learning objectives
• Understand the concepts of a population, sample, parameter, statistic, random variable
and data
• Distinguish between descriptive and inferential statistics
• Identify data types and measurement scales
• Know the difference between raw data and frequency data
• Understand sigma notation

Textbook reference
• Chapter 1
o §1.1 – §1.3, §1.7
o Exclude §1.4 – §1.6, §1.8 – §1.11
ATE01A1 – LU 1 1
INTRODUCTION

An essential part of the scientific research process is gathering, ordering, and analysing
information from which conclusions can be drawn and interpretations can be made. The
study of statistical methods focuses on how the data should be analysed so that meaningful
conclusions can be drawn.

THE LANGUAGE AND COMPONENTS OF STATISTICS

To perform statistical analyses, one must first understand the language of statistics. In this
section, we will define basic statistical terminology and concepts.

ATE01A1 – LU 1 2
Population

Population refers to the entire collection of individuals, objects, or items under


consideration. A population may be finite or infinite. For example, the shoes manufactured
on any given day in a factory are a finite population. However, all the outcomes when
flipping a coin repeatedly (and indefinitely) would be considered an infinite population. The
total number of elements in a population is denoted by N.

Parameter

A population parameter is a constant value (usually unknown) that describes some


measurable aspect of a population. Population parameters are generally denoted using
Greek letters.

ATE01A1 – LU 1 3
Sample
A sample is a subset of the population of interest. Samples are generally used to collect
information since considering the entire population is not always possible or feasible. The

total number of elements in a sample is denoted by n.

Statistic
A number calculated from sample data, which describes a measurable aspect of a sample,
is called a statistic. Sample statistics are generally denoted using Roman letters.

Sampling unit
A sampling unit is an object being measured, counted, or observed.

ATE01A1 – LU 1 4
Random variable
A variable is a characteristic of the elements of a population (or sample) for which the
observed values differ from element to element.
• In probability theory, where a variable assumes specific values with certain associated
probabilities, the variable is called a random variable.
• Variables are denoted by capital letters, e.g. X, Y, Z, and the values assumed by the
random variables are denoted by lowercase letters, e.g. x, y, z.
For example, let X = the height of boys in metres. Here, X is a random variable, which
measures the variable “height”. If three boys are selected at random, i.e. n = 3, and their
respective heights are 1.40m, 1.37m and 1.41m, the realisations of the random variable X
is given by 𝑥𝑖 for i = 1, 2, 3:
𝑥1 = 1.40 𝑥2 = 1.37 𝑥3 = 1.41

ATE01A1 – LU 1 5
Data
The actual values (numbers) or outcomes of all variables measured from the data.

Descriptive statistics
Descriptive statistics comprise those methods used to organise and describe information
that has been collected in a sample.

Inferential statistics
Inferential statistics comprise those methods and techniques used for making
generalisations, predictions or estimates about the population using sampled data.

ATE01A1 – LU 1 6
Notation

Sample statistic Population parameter


Mean 𝑥ҧ (x-bar)  (mu)
Variance 𝑠 2 (s-squared) 2 (sigma-squared)
Standard deviation s  (sigma)
𝑥
Proportion p=
𝑛
 (pi) [this is not the constant 3.1416]
Size n N

ATE01B1 – LU 1 7
Exercise 1.1
Consider the results of the three semester tests for ATE A:

1) All the ATE A students form the:


2) A selection of 50 ATE A students is a:
3) Each test is a:
4) The sampling unit is:
5) The results from all three tests form the:
6) The average mark for Test 1 is a:

7) To test whether the current group of ATE students performs better than groups from
previous years is the process of:

ATE01B1 – LU 1 8
UNDERSTANDING DATA

Data Types

Determining the most appropriate statistical method depends firstly on the problem
statement to be addressed and secondly on the type of data available. Specific statistical
methods are valid for certain data types only. Data types are identified by the nature of their
random variables. A random variable is categorical (qualitative) or numeric (quantitative).

ATE01B1 – LU 1 9.
1. Categorical random variables

Categorical variables are also known as qualitative variables. Such variables allow for
classification based on some characteristic. The variable ‘Eye colour’ can be classified as
brown, blue, green or grey.

The values of categorical variables are often recorded as numerical values. For example,
gender might be coded in the dataset as 1 = Male and 2 = Female, but these values have
no numerical meaning as they denote labels or categories of the variable. Such categorical
data can, therefore, only be counted to determine how many responses belong to each
category.

ATE01B1 – LU 1 10.
2. Numerical random variables

Numerical variables are also known as quantitative variables. Such variables are naturally
measured as numbers. For example, a person’s height in centimetres. Arithmetic
operations can be performed on the variables as the values have numerical meaning.

Numerical variables are further classified as either discrete or continuous:

• Discrete variables assume values obtained by counting (whole numbers or integers)


and consist of a finite number of values. For example, the number of students in a
class (75).

• Continuous variables assume values obtained by measuring and consist of infinite


values along the real line. For example, the time travel to work (28.4 min)

ATE01B1 – LU 1 11.
Measurement Scales

Data can also be classified in terms of its scale of measurement, i.e., the procedure used to
measure or obtain the data. There are four types of measurement scales: nominal, ordinal,
interval, and ratio.

1. Nominal

A categorical variable is measured on a nominal scale if the variable consists of two or


more categories with no intrinsic order (of equal importance).

For example, a person’s eye colour could be brown, blue, green or grey. There is no logical
way in which these four categories can be ordered. Nominal data is, therefore, usually
ordered alphabetically and then assigned a numeric value.

ATE01B1 – LU 1 12.
2. Ordinal

A categorical variable is measured on an ordinal scale if the variable consists of two or


more categories that can be ordered or ranked.

For example, a person’s age is classified as 1 = young, 2 = middle-aged, 3 = old. The three
possible values of this variable are ordered logically.

Note: in this example, numbers are used to reflect the measurement in order from low to
high, without any numeric meaning attached to the values.

ATE01B1 – LU 1 13.
3. Interval

A numerical variable (discrete or continuous) is measured on an interval scale if the values


of the variable can be arranged in order. Furthermore:

• there is no true or absolute zero, i.e., the value of zero is an arbitrary reference point

• Differences between data values are meaningful

• Ratios between values are not meaningful.

For example, temperature in degrees Celsius. The values are numerical and ordered. A
temperature of 0°C does not mean an absence of temperature, i.e., the scale has an
arbitrary zero value. The difference between 10°C and 20°C is the same as the difference
between 30°C and 40°C, namely a 10-degree difference. However, 20°C is not twice as hot
as 10°C, i.e., ratios are not meaningful.

ATE01B1 – LU 1 14.
4. Ratio

A numerical variable (discrete or continuous) is measured on a ratio scale if the values of


the variable can be arranged in order. Furthermore:

• there is a true or absolute zero

• differences between data values are meaningful

• ratios between values are meaningful.

For example, the amount of money in a bank account in Rand. The values are numerical
and ordered. An amount of R0 implies an absence of money, i.e., the scale has an absolute
zero value. The difference between R10 and R20 is the same as that between R30 and
R40, namely a R10 difference. R20 is twice as much money as R10, i.e., ratios are
meaningful.

ATE01B1 – LU 1 15.
Exercise 1.2
Data were collected from a random sample of 20 coffee consumers. The survey yielded the
following variables and data.
Consumer ID Household Daily coffee Coffee type Choice of Coffee affinity
Gender Age Highest qualification
number size consumption preference brand rating score
1 Male 24 Tertiary certificate 4 3 Instant 2 2.3
2 Male 26 Degree/Diploma 2 1 Instant 1 1.9
3 Female 25 Degree/Diploma 3 2 Filter 1 0.8
4 Female 30 Less than matric 5 7 Instant 5 4.4
5 Male 35 Tertiary certificate 1 4 Instant 3 3.1
6 Male 21 Tertiary certificate 1 1 Filter 3 0.4
7 Male 24 Degree/Diploma 4 2 Instant 4 1.8
8 Male 19 Matric 1 1 Filter 4 0.4
9 Female 28 Postgraduate degree 2 3 Instant 2 3.1
10 Female 34 Matric 3 2 Instant 1 1.9
11 Male 37 Tertiary certificate 2 5 Instant 1 4.9
12 Female 40 Postgraduate degree 5 2 Filter 3 0.6
13 Male 29 Degree/Diploma 4 1 Instant 1 0.1
14 Male 35 Degree/Diploma 2 4 Filter 5 3.6
15 Female 29 Matric 3 1 Filter 4 1
16 Male 19 Matric 6 2 Instant 4 1.4
17 Female 32 Degree/Diploma 1 3 Filter 3 2.4
18 Male 19 Less than matric 2 5 Instant 2 3.4
19 Female 26 Tertiary certificate 5 2 Instant 3 0.2
20 Female 36 Postgraduate degree 3 8 Instant 2 4.6
Daily coffee consumption = Number of cups
Choice of brand rating: 1 = Not important, 2 = Somewhat important, 3 = Important, 4 = Relatively important, 5 = Very important.
Coffee affinity score: derived from other information (one or more variables were combined) to calculate this new variable.

ATE01B1 – LU 1 16.
For each variable, identify the type and the scale of measure.

Variable Type Scale of measure


Consumer ID number
Gender
Age
Highest qualification
Household size
Daily coffee consumption
Coffee type preference
Choice of brand rating
Coffee affinity score

ATE01B1 – LU 1 17.
Data Formats

1. Raw data

Raw data refers to unprocessed information, also known as source data or primary data. All
information collected is first represented in raw data format, i.e., the dataset. As in the
previous example, the dataset is shown in a matrix format with rows and columns.
Variables are given in the columns, and observations are presented in the rows. A sample
of n observations and p variables will yield a dataset with n rows and p columns.

The steps to enter raw data into the calculator are as follows:

1) SETUP → down arrow → 3:STAT → 2:OFF

2) MODE → 2:STAT → 1:1–VAR

3) Enter variable values in the column labelled X

4) AC

ATE01B1 – LU 1 18.
2. Frequency data

Frequency data are raw data in an aggregated format where individual or a range of data
values are listed with a count of the number of times each value/range appears in the
dataset. This count is the frequency of occurrence or simply the frequency. It shows how
the data are distributed across the scale. Frequency data provide an overview of the
sampled information. Univariate frequency data represent counts of a single variable, and
bivariate frequency data represent counts of the combination of two variables. Steps to
enter frequency data into the calculator are given in Learning Unit 2.

ATE01B1 – LU 1 19.
Sigma Notation (self-read)

In mathematics, sigma notation is the standard notation used to represent summation. It is


a convenient and simple way to write long sums in a compact form. It is denoted by the
Greek capital letter sigma (𝛴). If a random variable X consists of n observations 𝑥1 , 𝑥2 , … , 𝑥𝑛 ,
the sum of all n values is represented in sigma notation as σ𝑛𝑖=1 𝑥𝑖 , or simply as σ 𝑥.

For example, if X = the number of children in a household where 𝑥1 = 2, 𝑥2 = 3 and 𝑥3 = 5,


then the total number of children in all three households in the sample is:

σ3𝑖=1 𝑥𝑖 = σ 𝑥 = 𝑥1 + 𝑥2 + 𝑥3 = 2 + 3 + 5 = 10

Note: the square of the sum is not equal to the sum of squares, i.e. σ 𝑥 2 ≠ σ(𝑥 2 )
2
Example: σ 𝑥 = (𝑥1 +𝑥2 + 𝑥3 )² = (2 + 3 + 5)² = (10)² = 100

σ(𝑥 2 ) = (𝑥1 ² + 𝑥2 ² + 𝑥3 ²) = (2² + 3² + 5²) = 38

ATE01B1 – LU 1 20.

You might also like