0% found this document useful (0 votes)
11 views

Chapter 1

Chapter 1 provides an overview of populations, samples, and processes in statistics, defining key terms and methods of sampling. It discusses descriptive statistics, including measures of location and variability, as well as methods for visualizing data. The chapter emphasizes the importance of understanding data types and distributions to effectively analyze and interpret statistical information.

Uploaded by

mazen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 1

Chapter 1 provides an overview of populations, samples, and processes in statistics, defining key terms and methods of sampling. It discusses descriptive statistics, including measures of location and variability, as well as methods for visualizing data. The chapter emphasizes the importance of understanding data types and distributions to effectively analyze and interpret statistical information.

Uploaded by

mazen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 1 : Overview and Descriptive Statistics

1.1 – Populations, Samples and Processes

 Definitions:
– _______________: A complete collection of all objects to be studied.
– _______________: the collection of data from every element in the population.
– _______________: a subset of the population.

EX) Suppose we are interested in the average GPA of all Camosun College students, then:

The relevant population is _________________________________________________

A sample can be _________________________________________________________

If we have all the Camosun College student’s GPAs then we have __________________

 Definitions cont:
– ________________: A numerical aspect of a ________________.
EX) The mean age of all Canadians, µ = 25.4

– ________________: A numerical aspect of a ________________.


EX) the average number of people in 20 randomly selected housing units, x =3.75.

Note: A __________________ can be used to estimate a _________________.

– ________________: any characteristic whose value may change from one object to another in
the population.
EX) X = height of a randomly selected student: 135cm,162cm...
Y = hair colour of a randomly selected student: black , brunette, blonde,...
– Data and Observations:
– ________________ data consists of observations on a single variable.
EX) height of STAT 218 students: 135cm,162cm...
– ________________ data consists of observations on each of two variables.
EX) height and weight of STAT 218 students: (135cm,50kg ), (162cm, 63kg )...
– ________________ data consists of observations on each of more than two variables.

EX) height, weight and hair colour of STAT 218 students:


(135cm,50kg , blonde), (162cm, 63kg , red )...
– Two Types of Processes:
– ______________________________: A process in which we observe and measure certain
characteristics, but we don’t attempt to manipulate or modify the subjects being studied. That is,
_________________________ are applied to the subjects studied.
Valuable for discovering trends and possible relationships but cannot be used to establish cause
and effect.
EX) One study observed that the higher a teacher’s salary, the higher the beer prices.
Did teacher’s pay increase cause beer prices to go up?

No, teacher’s pay increase did not cause beer prices to go up. This observational
study did not control other variables and it has a “lurking variable” – inflation, which
caused the both the beer price and the teacher’s salary to go up.

– ______________________________: A process in which we apply some treatment and then


proceed to observe its effects on the subject.
There is at least one control group (where subjects receive no treatment) in an experiment so that
comparisons can be made and any difference in the outcomes can be attributed to “treatment”.
Experiments can establish cause and effect.
EX) The 1954 American Polio Vacinne experiment followed 200,000 randomly
selected children given the Salk Vacinne and another 200,000 randomly selected
children given a placebo. The difference in the number of polio cases between the
two groups was attributed to the Salk Vacinne effect.

– Methods of Sampling:
– Random: Individuals are randomly selected from the population. Selections are made so that
each has an equal chance of being selected.

– Systematic: Randomly select some starting point and then select every kth element in the
population.

– Stratified: Subdivide the population into subgroups that share the same characteristic (i.e.
gender or age group) then draw a random sample from each subgroup (stratum).

– Cluster: Divide the population into sections (or clusters); each cluster contains individuals with
different characteristics; randomly select some of those clusters; choose all members from
selected clusters.

– Convenience: A sample is obtained by selecting individuals or objects without randomization.


For example, asking a MATH 218 class what program of study they are in to obtain information
about all of Camosun students programs.
 Branches of Statistics:
– ________________________ statistics: summary and description of collected data
(Sections 1.2 – 1.4)
– ________________________statistics: generalizing from a sample to a population (Chpts 6-9)

 Relationship between Probability and Statistics:

– To solve a _________________ problem, certain characteristics of a population are assumed to


be known. We then answer questions concerning a sample from that population.

– In a ________________ problem, we assume very little about a population. We use the


information about a sample to answer questions concerning the population.

1.2 – Pictorial and Tabular Methods in Descriptive Statistics

 Three important features to report when describing a distribution of a quantitative variable


– ______________
– ______________
– ______________

 Distribution Shapes:

Symmetric unimodal bimodal

Right or positive skew Left or negative skew


 The ____________________of the distribution (chpt 1.3)
EX) mean family size, median housing price

 The ____________________of the distribution (chpt 1.4)


EX) variance, standard deviation, range, fourth spread

DATA

CATEGORICAL NUMERICAL

DISCRETE CONTINUOUS

 Two types of Numerical Variables

– A variable is ________________ if its set of possible values constitute a finite set or an infinite
sequence. Usually results from counting.
EX) class size

– A variable is _______________________if its set of possible values consist of an entire interval


on a number line. Usually results from a measurement.
EX) height

EX) The following are ages of 34 Oscar-Winning Best Actresses:


21 24 26 26 26 27 28 30 30 31
31 33 33 34 34 34 35 35 35 37
37 38 41 41 41 42 44 49 50 60
61 61 74 80

We will describe this (numerical) data set using several graphing techniques.
1. Histogram

Frequency Table
Guidelines for constructing frequency tables and histograms.
(1) Number of intervals: 𝐾 ≅ √𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
we choose 𝐾 = 6 since 𝑛 = 34
(2) Starting point ≤ smallest data value. Choose 20 < 21
𝑚𝑎𝑥−𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑝𝑜𝑖𝑛𝑡 80−20
(3) Class width: 𝑊 > 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 = = 10
6

Intervals Frequency Relative Cumulative Cumulative


Frequency Frequency Relative
Frequency
(20,30] 9 9/34 9 9/34
(30,40] 13 13/34 22 22/34
(40,50] 7 7/34 29 29/34
(50,60] 1 1/34 30 30/34
(60,70] 2 2/34 32 32/34
(70,80] 2 2/34 34 34/34
Total 34 1.00
2. Stem-and-leaf Plot
The decimal point is 1 digit(s) to the right of the |

2 | 1466678

3 | 001133444555778

4 | 111249

5 | 0

6 | 011

7 | 4

8 | 0

Features of the Stem-and-leaf plot:

– Displays the shape of the distribution: right-skewed.

– Sorts data

– Shows all data points

 Histograms of discrete data


37 people surveyed on the number of credit cards they own. Here is the data:
Number of cards Frequency
0 3
1 11
2 15
3 4
4 2
5 1
6 1
 Graphing Categorical Data:
1.3 – Measures of Location

 Center of a Dataset

a) Mean – is the balance point

Given a sample: x1, x2, …, xn where n = sample size

𝑥1 +𝑥2 +⋯+𝑥𝑛 ∑𝑛
𝑖=1 𝑥𝑖
Sample Mean: 𝑥̅ = =
𝑛 𝑛

Given a population: x1, x2, …, xN where N = population size

𝑥1 +𝑥2 +⋯+𝑥𝑁 ∑𝑁
𝑖=1 𝑥𝑖
Population Mean: 𝜇 = =
𝑁 𝑁

EX) Suppose we have data from two samples. Find the mean of each.

Sample x: 5, 7, 9

Sample y: 50, 70, 90

This example leads us to the following property:

If y=cx, where c is a constant, then 𝑦̅ = 𝑐𝑥̅ and/or 𝜇𝑦 = 𝑐𝜇𝑥


EX) Suppose we have a population: {1, 2, 3, 4, 5, 6}, find the mean:

If we take a sample from this population and get sample = {1, 4, 5}, find the mean:

Note: Usually the sample mean is NOT equal to the population mean but 𝑥̅ can be used to estimate µ.

b) Median – the middle point of ordered data

For ordered data values x1, x2, …, xn

𝑛+1 𝑡ℎ
* if n is odd, the median is the middle or ( ) value
2

* if n is even, the median is the average of the two middle values: that is, the average of the
𝑛 𝑡ℎ 𝑛 𝑡ℎ
(2) and (2 + 1) values

Sample median: is always denoted by 𝑥̃

Population median: is always denoted by 𝜇̃

EX) Suppose we have data on two samples. Find the median of each.

Sample x: 2, 2, 3, 7, 8

Sample y: 20, 20, 30, 70, 80


This example leads us to another property:

If y=cx, where c is a constant, then 𝑦̃ = 𝑐𝑥̃ and/or 𝜇̃𝑦 = 𝑐𝜇̃𝑥

Note: Usually the sample median is NOT equal to the population median but 𝑥̃ can be used to estimate µ̃.

Mean Vs Median

Q) Find the mean and median of each sample:

x: 4, 7, 11, 17, 22

y: 4, 7, 11, 17, 220

* Median is a better measure of center if there are extreme values.

* Mean is sensitive to extreme values while the median is not


c) Trimmed Mean

The trimmed mean is a compromise between the mean and the median

EX) Given the following sample, x: 20, 60, 67, 70, 99. Find the

a) 20% trimmed mean, 𝑥̅ 𝑡𝑟(20)

b) 25% trimmed mean, 𝑥̅𝑡𝑟(25)


Q) Find the 30% trimmed mean, 𝑥̅ 𝑡𝑟(30)

d) Mode

The mode is the value that occurs most

EX) Suppose we have the following data sets. Find the mode of each

a) 2, 2, 3, 5

b) 2, 2, 3, 5, 5

c) 2, 2, 3, 3, 5, 7, 7
 Location of a Data Point in Ordered Datasets
EX) Your test score is 85, where do you stand in the class?

a) Quartiles (Q1, Q2, Q3) or fourths


They divide the data set into approximately four equal parts

EX) Given the data: 2, 5, 7, 10, 17, 14


Find the 5-Number summary: min Q1 Q2 Q3 max
EX) Given the data: 2, 5, 7, 10, 14
Find the 5-Number summary

*Note: Include the middle value into both the lower and upper groups to find Q1 and Q3 if n is odd

b) Percentiles (p1, p2, …, p100)


The percentiles are number that break the ordered data into 100 equal pieces approximately
 Categorical Data and Sample Proportions

EX) Consider a survey of n people on the question of “Do you agree with marijuana legalization?”
There are 3 categories for response: Yes No No Opinion
Number in each category: x1 x2 x3

The proportion of people who support legalization:

The proportion of people who are against legalization:

Let p1 = the true proportion of all Canadians who support legalization then 𝑝̂1 can be used to estimate p1

EX) Flip a coin 8 times: H H T H T T H H

If we think of H=1 and T=0 then we could find the proportion of heads by adding up all the 1’s.

We can think of proportion as a special type of mean. In fact, they share many properties.
1.4 – Measures of Variability

Let’s look at two different distributions:

Both distributions are bell-shaped, have the same center but have very different spreads: variability
matters!!

EX) Consider two samples. Find the mean and median of each.

A: 6.5 6.6 6.7 6.8 7.1

B: 4.4 5.1 6.7 7.3 10.2


EX) Suppose we have the following information for waiting times(in minutes) at a bus stop.

6.5 6.8 7.0 7.7 the mean = 7.0

Data Deviation Absolute Deviation Squared Deviation


x 𝑥𝑖 − 𝑥̅ (𝑥𝑖 − 𝜇) |𝑥𝑖 − 𝑥̅ | (|𝑥𝑖 − 𝜇|) (𝑥𝑖 − 𝑥̅ )2 [ (𝑥𝑖 − 𝜇)2 ]

 Range

The range, R, is the difference between the min data value and the max data value

EX) For the waiting times example above, find R

 MAD (Mean Absolute Deviation)


∑𝑛
𝑖 |𝑥𝑖 −𝑥̅ |
MAD = 𝑛

EX) Find MAD for the waiting times example


 Variance

∑𝑁
𝑖 (𝑥𝑖 −𝜇)
2
For population: x1, x2, …, xN The population variance: 𝜎 2 = 𝑁

∑𝑛
𝑖 (𝑥𝑖 −𝑥̅ )
2
For sample: x1, x2, …, xn The sample variance: 𝑠 2 = 𝑛−1

Why is the sample variance divided by n – 1 instead of n?

- because the xi’s tend to be closer to their average 𝑥̅ than to the population average µ so
n – 1 is used instead of n to compensate.

- if we used n then the sample would tend to underestimate 𝜎 2

EX) Waiting times:

If we are thinking of the data as a population then find 𝜎 2 =

If we are thinking of the data as a sample then find 𝑠 2

 Standard Deviation (SD)

Population standard deviation = 𝜎 = √𝜎 2

EX) Waiting times: find 𝜎

Sample standard deviation = 𝑠 = √𝑠 2

EX) Waiting times: find 𝑠


Why do you think we might prefer s to s2?

Q) Find the mean and standard deviation for the samples:

a) x: 2, 5, 4, 7, 9

b) y: 20, 50, 40, 70, 90

c) w: 25, 55, 45, 75, 95

d) u: -25, -55, -45, -75, -95


FACT: If y = a + bx, where a, b are constants, then

𝑦̅ = 𝑎 + 𝑏𝑥̅

𝑠𝑦2 = 𝑏 2 𝑠𝑥2

𝑠𝑦 = |𝑏|𝑠𝑥

 Application of Standard Deviation

Empirical Rule: If X has bell-shaped distribution with mean = µ and standard deviation = σ then,

- approximately 68% of values of X fall within 1 standard deviation of the mean

- approximately 95% of values of X fall within 2 standard deviations of the mean

- approximately 99.7% of values of X fall within 3 standard deviation of the mean

𝑟𝑎𝑛𝑔𝑒
Therefore, for bell-shaped distributions: 6𝜎 ≈ 𝑟𝑎𝑛𝑔𝑒 → 𝜎 ≈ 6

EX) Suppose that adult IQ scores follow a bell-shaped distribution with mean µ=100 and standard
deviation σ=15. What values do 95% of all such scores fall between?
 Fourth Spread, fs

fs = upper fourth – lower fourth

= Q3 – Q1

= IQR (Interquartile Range)

= range of the middle 50% of the data

The fourth spread is NOT sensitive to outliers. In fact, we use it to identify outliers!

 Outliers

An observation, x, is a mild outlier if:


x < Q1 – 1.5fs or x > Q3 + 1.5fs

An observation, x, is an extreme outlier if:

x < Q1 – 3fs or x > Q3 + 3fs

 Boxplot

Making Boxplots

* First find the five-number summary and fs

* Find the outlier cut points: Q1 – 3fs, Q1 – 1.5fs, Q3 + 1.5fs, Q3 + 3fs


EX) Given the sample, x: 4, 17, 20, 22, 26

Draw the boxplot


Boxplot and Shape

* A distribution is symmetric if both the box and the whiskers are symmetric about Q2

* Longer whiskers imply a longer tail in the distribution

Q) A sample of 20 glass bottles of a particular type was selected and the internal pressure of strength of each
bottle was determined. Consider the following partial sample information:

median = 202.2 lower fourth = 196.0 upper fourth = 216.8

Three smallest observations: 125.8 188.1 193.7

Three largest observations: 221.3 230.5 250.2

a) Are there any outliers in the sample? Any extreme outliers?

b) Construct a boxplot that shows outliers, and comment on any interesting features.

You might also like