Statistics Midterm Review
Statistics Midterm Review
1. Stem-and-leaf plot:
Frequency Stem Leaf
2 0 7 9
24 1 0 0 0 0 1 1 1 2 3 3 3 4 4 5 6 6 6 6 7 7 8 8 9 9
11 2 0 0 1 1 2 3 4 6 6 7 8
4 3 1 7 7 8
1 4 2
2 5 0 9
Total: 44
Use:
• To reveal essential data features (central tendency, dispersion)
For the example above:
o Central tendency: 24 of the 44 P/E ratios were in the 10-19 stem
o Dispersion: the range is from 7 to 59
• To retrieve the raw data by concatenating a stem digit with each of its leaf digits
2. Dot plots: Reveal dispersion, central tendency, and the shape of the distribution
Interpretation:
• The range is from 7 to 59.
• All but a few data values lie between 10 and 25.
• A typical “middle” data value would be around 17 or 18.
• The data are not symmetric due to a few large P/E ratios.
3. Frequency distribution:
Step 1: Choose the number of bins based on sample size (n)
Step 2: Construct the frequency distribution table
5. Scatter plot:
The relationship between X and Y:
Example: Use Stata to make a scatter plot from the data file GPA1.dta, with hsGPA on the X-axis
and colGPA on the Y -axis. Describe the relationship (if any) between X and Y. Weak? Strong?
Negative? Positive? Linear? Nonlinear?
Answer: Weak positive relationship between high school GPA and MSU GPA.
b. Median (M): the 50th percentile or midpoint of the sorted sample data (n)
• If n is odd, the median is the middle observation.
Example:
1 2 4 6 7 9 11 15 20
Median
• If n is even, the median is the average of the middle two observations.
Example:
1 2 3 5 7 8 11 15 17 19
&'(
Median = ) = 7.5
c. Mode: The most frequently occurring data value
• May have multiple modes or no mode
• Most useful for discrete data
Example:
Anna’s scores: 50 50 50 65 65 70 => Mode = 50 (occurs 3 times)
Ben’s scores: 40 50 60 70 75 90 => Mode = none
Charlie’s scores: 60 60 80 90 90 => Mode = 60, 90
d. Shape: The shape of the distribution is judged by comparing the mean and median.
e. Geometric mean (G): used when all the data values are positive (> 0).
𝑮 = 𝒏,𝒙𝟏 𝒙𝟐 … 𝒙𝒏
Example: The geometric mean for X = 2, 3, 7, 9, 10, 12 is:
" "
𝐺 = ,(2)(3)(7)(9)(10)(12) = ,45,360 = 5.972
f. Growth rates: The average growth rate for a time series
𝒏#𝟏𝒙𝒏
𝑮𝑹 = ; −𝟏
𝒙𝟏
Example:
3. Standardized data:
a. Calculate the standardized value (z-score):
For a population:
𝒙−𝝁
𝒛=
𝝈
For a sample:
𝒙2𝒙3
𝒛 = 𝒔 (𝑛 ≥ 30)
b. 2 theories:
Chebyshev’s Theorem Empirical Rule
For any data set, no matter how it is For data from a normal distribution,
distributed, the percentage of we expect the interval 𝜇 ± 𝑘𝜎 to
observations that lie within k contain a known percentage of
standard deviations of the mean the data:
(i.e., within 𝜇 ± 𝑘𝜎) must be at • 𝑘 = 1 68.26% will lie within 𝜇 ±
$
least 100[1 − 5 & ] 𝜎
• 𝑘 = 2 at least 75.0% will lie • 𝑘 = 2 95.44% will lie within 𝜇 ±
within 𝜇 ± 2𝜎 2𝜎
• 𝑘 = 3 at least 88.9% will lie • 𝑘 = 3 99.73% will lie within 𝜇 ±
within 𝜇 ± 3𝜎 3𝜎
• 𝑘 = 4 at least 93.8% will lie Note: Data values outside 𝜇 ± 3𝜎
are called outliers.
within 𝜇 ± 4𝜎
• Unusual: 2 < |𝑧| ≤ 3
• Outlier: |𝑧| > 3
Example 1. For an exam with 𝜇 = 72 and For a price survey with 𝜇 = 100 and
𝜎 = 8, what is the interval that 𝜎 = 25. If the price of product A is
at least 75% of the scores will 150, find its standardized data. Is this
be within? an outlier?
Answer: 72 ± 2(8) or [56, 88] Answer:
2. Suppose 400 students take an 627 $9:2$::
𝑧 = 8 = )9 = 2
exam. At least how many
students will have their scores ð The price of product A is not
lie within [120, 200] if 𝜇 = an outlier.
160 and 𝜎 = 10?
Answer:
200 = 160 + 𝑘(10) → 𝑘 = 4
® At least 93.8% of 400 students,
specifically 375 students, will have
their scores lie within [120, 200].
4. Percentiles, quartiles, and box plots:
a. Percentiles: The value below which a percentage of data falls
Exercises:
Given the data set:
30 50 45 50 40 80 70
Find the percentage of data that falls below the value of 45 (the percentile of
value 45). Find the value of the 66th percentile.
1. Given value => calculate percentile
Step 1: Sort the data given
Position Value
1 30
2 40
3 45
4 50
5 50
6 70
7 80
Step 2: Find the percentage of data that falls below the value of 45
1
Source: https://www.m7athsisfun.com/data/percentiles.html
2
× 100 = 28.57%
7
Thus, the value of 45 is the 29th percentile in the data set.
2. Given percentile => calculate value: Excel’s quartile interpolation method (*)
Step 1: Find the position
(%'$)=
$::
with n: the number of values in the sample
p: the percentile given
(&'$)??
The 66th percentile position: 𝑝> = $:: = 5.28 => lie between position 5 and 6
Step 2: Find the values
The value of the 66th percentile: 50 + 0.28(70 − 50) = 55.6
b. Quartiles: Scale points that divide the sorted data into 4 groups of approximately
equal size
Note:
• 25th percentile = the first quartile = lower quartile = Q1
• 50th percentile = the second quartile = median = Q2
• 75th percentile = the third quartile = upper quartile = Q3
Interquartile range: measures the degree of spread in the data (the middle 50%)
𝑰𝑸𝑹 = 𝑸𝟑 − 𝑸𝟏
How to find Q1, Q2, Q3? Use the method of medians or Excel’s quartile
interpolation method.
1. Method of medians:
So, for the example in figure 4.25, there are no outliers and extreme outliers.
−1 ≤ 𝑟 ≤ +1
• 𝑟 ≈ 0: There is little or no linear relationship between X and Y.
• 𝑟 near +1: Strong positive relationship between X and Y.
• r near -1: Strong negative relationship between X and Y.
b. Covariance: measures the degree to which the values of X and Y change together
For example, prices of two
stocks X and Y:
• Move in the same
direction: 𝜎6D > 0
• Move in opposite
directions 𝜎6D < 0
• The prices of X and Y are
unrelated: 𝜎6D = 0
Example: Your laptop gets warm (even hot) when you place it on your lap because it
is dissipating heat from its microprocessor and related components. Calculate the
correlation coefficient.
Answer:
X: Microprocessor Speed
Y: Power Dissipation
∑$E
!#$ 𝑥!
𝑥̅ = = 1703.79
14
∑$E
!#$ 𝑦!
𝑦u = = 70
14
∑$E
!#$(𝑥! − 𝑥̅ )(𝑦! − 𝑦
u) 802505
𝑟= = = 0.962
,∑$E ) $E
!#$(𝑥! − 𝑥̅ ) ,∑!#$(𝑦! − 𝑦 u)) √25194288√27624
Using your calculator to find r as follow (for Casio fx-580VNX):
Menu ® 6: Statistics ® 2: y=a+bx ® Enter all the data given for X and Y ® AC ®
OPTN ® 3: Regression calculation ® r = 0.962
6. Skewness and Kurtosis:
a. Skewness:
b. Kurtosis: The relative length of the tails and the degree of concentration in the
center
h. Independent events:
Event A is independent of event B ó 𝑃(𝐴|𝐵) = 𝑃(𝐴)
If events A and B are independent, then 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵)
If events are independent, 𝑃(𝐴$ ∩ 𝐴) ∩ … ∩ 𝐴% ) = 𝑃(𝐴$ )𝑃(𝐴) ) … 𝑃(𝐴% )
3. Contingency tables:
a. Marginal probabilities:
The joint probability that the school has low tuition (T1) and has large salary gains (S3):
1
𝑃(𝑇$ ∩ 𝑆B ) = = 0.0149
67
ð There is less than a 2 percent chance that a top-tier school has both low tuition and
high salary gains.
c. Conditional probabilities:
The conditional probability that salary gains are small (S1) given that the MBA
tuition is large (T3):
5
𝑃(𝑆$ |𝑇B ) = = 0.1563
32
ð There is about a 16 percent chance that a top-tier school’s salary gains will be small
despite its high tuition.
4. Bayes’ theorem:
𝑷(𝑨|𝑩)𝑷(𝑩) 𝑷(𝑨|𝑩)𝑷(𝑩)
𝑷(𝑩|𝑨) = =
𝑷(𝑨) 𝑷(𝑨|𝑩)𝑷(𝑩) + 𝑷(𝑨|𝑩> )𝑷(𝑩> )