Chapter 1
Chapter 1
Definitions:
– _______________: A complete collection of all objects to be studied.
– _______________: the collection of data from every element in the population.
– _______________: a subset of the population.
EX) Suppose we are interested in the average GPA of all Camosun College students, then:
If we have all the Camosun College student’s GPAs then we have __________________
Definitions cont:
– ________________: A numerical aspect of a ________________.
EX) The mean age of all Canadians, µ = 25.4
– ________________: any characteristic whose value may change from one object to another in
the population.
EX) X = height of a randomly selected student: 135cm,162cm...
Y = hair colour of a randomly selected student: black , brunette, blonde,...
– Data and Observations:
– ________________ data consists of observations on a single variable.
EX) height of STAT 218 students: 135cm,162cm...
– ________________ data consists of observations on each of two variables.
EX) height and weight of STAT 218 students: (135cm,50kg ), (162cm, 63kg )...
– ________________ data consists of observations on each of more than two variables.
No, teacher’s pay increase did not cause beer prices to go up. This observational
study did not control other variables and it has a “lurking variable” – inflation, which
caused the both the beer price and the teacher’s salary to go up.
– Methods of Sampling:
– Random: Individuals are randomly selected from the population. Selections are made so that
each has an equal chance of being selected.
– Systematic: Randomly select some starting point and then select every kth element in the
population.
– Stratified: Subdivide the population into subgroups that share the same characteristic (i.e.
gender or age group) then draw a random sample from each subgroup (stratum).
– Cluster: Divide the population into sections (or clusters); each cluster contains individuals with
different characteristics; randomly select some of those clusters; choose all members from
selected clusters.
Distribution Shapes:
DATA
CATEGORICAL NUMERICAL
DISCRETE CONTINUOUS
– A variable is ________________ if its set of possible values constitute a finite set or an infinite
sequence. Usually results from counting.
EX) class size
We will describe this (numerical) data set using several graphing techniques.
1. Histogram
Frequency Table
Guidelines for constructing frequency tables and histograms.
(1) Number of intervals: 𝐾 ≅ √𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
we choose 𝐾 = 6 since 𝑛 = 34
(2) Starting point ≤ smallest data value. Choose 20 < 21
𝑚𝑎𝑥−𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑝𝑜𝑖𝑛𝑡 80−20
(3) Class width: 𝑊 > 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 = = 10
6
2 | 1466678
3 | 001133444555778
4 | 111249
5 | 0
6 | 011
7 | 4
8 | 0
– Sorts data
Center of a Dataset
𝑥1 +𝑥2 +⋯+𝑥𝑛 ∑𝑛
𝑖=1 𝑥𝑖
Sample Mean: 𝑥̅ = =
𝑛 𝑛
𝑥1 +𝑥2 +⋯+𝑥𝑁 ∑𝑁
𝑖=1 𝑥𝑖
Population Mean: 𝜇 = =
𝑁 𝑁
EX) Suppose we have data from two samples. Find the mean of each.
Sample x: 5, 7, 9
If we take a sample from this population and get sample = {1, 4, 5}, find the mean:
Note: Usually the sample mean is NOT equal to the population mean but 𝑥̅ can be used to estimate µ.
𝑛+1 𝑡ℎ
* if n is odd, the median is the middle or ( ) value
2
* if n is even, the median is the average of the two middle values: that is, the average of the
𝑛 𝑡ℎ 𝑛 𝑡ℎ
(2) and (2 + 1) values
EX) Suppose we have data on two samples. Find the median of each.
Sample x: 2, 2, 3, 7, 8
Note: Usually the sample median is NOT equal to the population median but 𝑥̃ can be used to estimate µ̃.
Mean Vs Median
x: 4, 7, 11, 17, 22
The trimmed mean is a compromise between the mean and the median
EX) Given the following sample, x: 20, 60, 67, 70, 99. Find the
d) Mode
EX) Suppose we have the following data sets. Find the mode of each
a) 2, 2, 3, 5
b) 2, 2, 3, 5, 5
c) 2, 2, 3, 3, 5, 7, 7
Location of a Data Point in Ordered Datasets
EX) Your test score is 85, where do you stand in the class?
*Note: Include the middle value into both the lower and upper groups to find Q1 and Q3 if n is odd
EX) Consider a survey of n people on the question of “Do you agree with marijuana legalization?”
There are 3 categories for response: Yes No No Opinion
Number in each category: x1 x2 x3
Let p1 = the true proportion of all Canadians who support legalization then 𝑝̂1 can be used to estimate p1
If we think of H=1 and T=0 then we could find the proportion of heads by adding up all the 1’s.
We can think of proportion as a special type of mean. In fact, they share many properties.
1.4 – Measures of Variability
Both distributions are bell-shaped, have the same center but have very different spreads: variability
matters!!
EX) Consider two samples. Find the mean and median of each.
Range
The range, R, is the difference between the min data value and the max data value
∑𝑁
𝑖 (𝑥𝑖 −𝜇)
2
For population: x1, x2, …, xN The population variance: 𝜎 2 = 𝑁
∑𝑛
𝑖 (𝑥𝑖 −𝑥̅ )
2
For sample: x1, x2, …, xn The sample variance: 𝑠 2 = 𝑛−1
- because the xi’s tend to be closer to their average 𝑥̅ than to the population average µ so
n – 1 is used instead of n to compensate.
a) x: 2, 5, 4, 7, 9
𝑦̅ = 𝑎 + 𝑏𝑥̅
𝑠𝑦2 = 𝑏 2 𝑠𝑥2
𝑠𝑦 = |𝑏|𝑠𝑥
Empirical Rule: If X has bell-shaped distribution with mean = µ and standard deviation = σ then,
𝑟𝑎𝑛𝑔𝑒
Therefore, for bell-shaped distributions: 6𝜎 ≈ 𝑟𝑎𝑛𝑔𝑒 → 𝜎 ≈ 6
EX) Suppose that adult IQ scores follow a bell-shaped distribution with mean µ=100 and standard
deviation σ=15. What values do 95% of all such scores fall between?
Fourth Spread, fs
= Q3 – Q1
The fourth spread is NOT sensitive to outliers. In fact, we use it to identify outliers!
Outliers
Boxplot
Making Boxplots
* A distribution is symmetric if both the box and the whiskers are symmetric about Q2
Q) A sample of 20 glass bottles of a particular type was selected and the internal pressure of strength of each
bottle was determined. Consider the following partial sample information:
b) Construct a boxplot that shows outliers, and comment on any interesting features.