0% found this document useful (0 votes)
36 views21 pages

AP Stats Study Guide 1 1 1

Uploaded by

LLLL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views21 pages

AP Stats Study Guide 1 1 1

Uploaded by

LLLL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Interpretations: (Always put in context)

percentile - percentile % of context is less than or equal to value.


Standard deviation

r - The linear association between x-context and y-context is weak/moderate/strong


(strength) and positive/negative (direction).

r^2 - About % of the variability in (y with units) is accounted for by the least squares
regression line with x = (x with units)

Standard deviation of residuals: (s) - The actual [y] is typically s away from the value
predicted by the LSRL

Slope: the predicted y increases/decreases by slope for each additional x.

P Value:
Assuming null hypothesis is true, there is a (P-value)% probability of getting a sample
mean as (large or larger/small or smaller/at or more extreme) just by chance

Confidence Interval: We are % Confident that the interval { . } contain the true
population parameter.

Confidence Level: If we take many samples of [context] and calculate many intervals,
then 95% of the intervals will contain population parameters.

Standard error: If we take many, many samples of slopes, then it will be, on average,
[standard error] from the true population slope

Acronyms:
(Ch. 1) Describe Distribution in a HISTOGRAM: CUSS -
Center (median)
Unusual Points (outliers by 1.5*IQR rule
Spread (most points lie between range of values, or say IQR)
Shape (uniform, bimodal, skewed left or right, symmetrical)
Describe association in a SCATTERPLOT: DUFS
Direction,
Unusual Features,
Form,
Strength.
Binary: BINS - Binary, Independent, Fixed sample value, same probability for all
Geometric: BIS - Binary, Independent, same probability for all.

Inferences:
Verify Conditions: SIN -
Random sample/SRS,
Independent: sample size n < 10% of population N.
Normal: np > 10, n(1-p)>10

---------------------------------------------

Ch. 1 - Data Analysis


Ways to Display Information
Pie chart, segmented bar chart, mosaic chart, Histogram

There are more males than females because the bottom right box is bigger. However, females
were offered admission more.

Measures of Center:
Resistant: Median & IQR
Non-resistant: Mean, Standard Deviation, Variance. quartiles
The standard error (standard deviation of statistic), and correlation is not affected by
any operation

If the mean decreases, standard deviation either stays the same or decreases.

Transformations:
Addition and Multiplication(+, x): Mean, Median, Mode, Range
Multiplication(x): Standard Deviation, IQR
Standard Deviation.
Average distance of values from the mean
High SD = flatter b/c values are farther away from mean
Low SD = skinnier and higher in center bc values are closer

Outlier Rule: 1st Quartile - 1.5(IQR) & 3rd Quartile + 1.5(IQR)


IQR is the middle 50% of the data.
2 standard deviations rule: anything outside of 2 sd from mean is an outlier

Right skewed: mean > median


Left skewed: mean < median

5 number summary: Minimum, 1st quartile, Median, 3rd quartile, Maximum


You can do 1 var stats.

For the smallest standard deviation number set, any set of four identical numbers
would will have
sx = 0, so there are multiple correct answer choices.
For the largest standard deviation number set, there is only one possible answer. The
largest standard deviation will come from two values at each extreme.

Ch. 2 - Exploring 2 Var Data


Cumulative Frequency:
50% = median, 75% and 25% are quartiles
Steepest slope = most data points
Probability Plots:

The fastest 3% means the 97th percentile


Density curve area must be 1 and unifor

Checking for Associations

Marginal Relative Frequency


Conditional Relative Frequency
Or check with Chi Square test of Independence
P value <0.05 = Independent

Ch. 3 - Describing Relationships


Leverage - distance from mean
in a least square regression line, the sum of residuals = 0
mean = 0, median may or may not be 0

The linear regression line minimizes the sum of squares of residuals.

If r^2 = 1, sum of squares of the residuals is 0.

Correlation:
Resistant: NOT affected by unit changes, multiply/divide/add/subtract.
Not affected by which variable is x or y, x or y does not need to be defined

Affected by outliers/extreme values.

Correlation does NOT equal Causation, ONLY experiments can imply causation
R^2
The pattern in residual plot = non-linear
Random residual plot = linear

“Linear” means constant change.

Ch. 4 -Experimental Design/Collecting Data


retrospective - past
prospective - future

Good Experimental Design:


● Control Variable:
- Benefit: Have a baseline to compare against
● Randomization
● Replication
within (have multiple subjects) vs outside the study (have different subjects in
another study)
● Blocking
- Benefit: We reduce confounding variables.
Random assignment is a good way to create two roughly equivalent groups.

Blocking: Sample 100 people into 50 males, and 50 females, randomize treatments
Benefit: We reduce confounding variables.
Matched Pairs:
● 30 stores. sample the 2 closest by location. Label store as 1 and the other as 2.
Flip a coin, if the coin lands on heads, store 1 gets treatment A, and store 2 gets
treatment B. If the coin lands on tails, store 1 gets Treatment B and store 2 gets
Treatment A. Compare treatments between stores. Repeat for the remaining
pairs.

● 30 people.
For each pair of twins, label one person as twin A and label the other person as
twin B. For each pair of twins, toss a coin. If the coin lands on heads, twin A gets
the placebo and twin B gets the active drug. If the coin lands on tails, twin A gets
the active drug and twin B gets the placebo.

Benefit: Reduce variability, (put in context)


Placebo effect - ex. Where people believe a drug that has no effect made an effect on
the experiment.
Can introduce confounding variables (-)

Replication: Having enough people in

Double Blinding: Neither subjects nor people administrating the treatment are aware of
who gets what treatment. Only the data collectors know.

Random sampling allows us to generalize the sample to the population


Random assignment allows us to establish causation (elimate confounding variables) and
creates roughly equivalent groups.

Clustering

Less expensive , Less accurate/precise

Stratification
More accurate (+), More expensive (-)

Bias:
sampling method is biased if it produces estimates that are consistently smaller or
larger than the true value in the population.

Bias methods:
volunteer survey - volunteer selection can lead to non-response/under coverage

Selection Bias - anything wrong with the sampling method


Under-Coverage - people who aren’t included. EX. People without internet on an
internet survey
Non-Response - people who don’t respond/don’t want too
nonrespondents may have different opinions than those who respond.

Confounding: When other lurking variables change outcome of the response variable.
Ch. 5 - Probability

If A and B are independent, then they cannot be mutually exclusive


If A and B are mutually exclusive, then they cannot be independent

Independent means P(A) = P(A given B)


Mutually exclusive means P(A and B) = 0

Independent -> P(A or B) = P(A) + P(B) - P(A and B)


Mutually exclusive -> P(A or B) = P(A) + P(B)
Var(X-Y) = Var(X) + Var(Y)

When it says “Is it unusual that…” take the probability at that point or higher
Don't say sampling variability or not provide a probability.

Disease problems (tree chart)

Ch. 6 - Random Variables


Always declare your random variable, X = ?
You cannot add standard deviation. Do √ (s1^2 + s2^2)

binomial “atleast” questions.


Probability of atleast 4 successes
P(X >= 4) = 1 - P(X<=3)

geometric “atleast” questions.


(MCQ) Combining Random Variables and Probability
“Probability of one group being greater than other at any time”

P(sean - evan > 0) Add means and standard deviations


225- 240 = -15

2 2
25 + 15 = 29.155

P( sean - evan > 0) = normalcdf(0, 10^99, -15, 29.155) = 0.303


Answer: D

(MCQ) Conditional probability with normalcdf

top 20% is the 80th percentile. invNorm(0.8, 80, 7) = 85.89


P(certificate of merit | hrs worked < 90) =
𝑃(𝑐𝑒𝑟𝑡𝑖𝑓𝑖𝑐𝑎𝑡𝑒 𝑜𝑓 𝑚𝑒𝑟𝑖𝑡 𝐴𝑁𝐷 ℎ𝑟𝑠 𝑤𝑜𝑟𝑘𝑒𝑑 < 90) / 𝑃(ℎ𝑟𝑠 𝑤𝑜𝑟𝑘𝑒𝑑 < 90)
𝑃(ℎ𝑟𝑠 𝑤𝑜𝑟𝑘𝑒𝑑 < 90)

P(certificate of merit AND hrs worked < 90) / P(hrs worked < 90)
P(hrs worked < 90) = normalcdf(-10^99, 90, 80, 7) = 0.923
𝑃(85.89 < 𝑋 < 90) 0.123
𝑃(𝑋 < 0.923)
= 0.923
= 0.133

Answer: C

Ch. 7 - Sampling Distributions


Statistic vs Parameter

Ch. 8 - Confidence Intervals

when the question says ‘estimate’, it refers to an interval

Relationships
↑ n , ↓ confidence size, margin of error

Doubling sample size n results in √n confidence interval.

As z* ↓, C-Level ↓, Margin of Error ↓


as standard deviation ↓, Margin of Error ↓

½ (Margin of Error) = 4n (inverse square)


2 (Margin of Error) = ¼ n

2n sample size = 1/sqrt(2) interval length


3n sample size = 1/sqrt(3) interval length

½ n sample size = sqrt(2) interval length

If you get closer to the true value, the interval becomes wider.

Drawback of larger samples: take more time & money to get

if the confidence level is the same, risk of being incorrect is the same
ME only accounts for sampling variability by chance NOT BIAS

MCQ Practice:
“Minimum sample size for margin of error”

.60(1−0.60)
z* 900
-> z* 0.0163 ≤ 0.027 -> z* ≤ 1.653
normalcdf(-10^99, 1.653, 0, 1) = 0.95
Answer: C
Relationships:

Pick a random standard deviation.


100/ sqrt(9000) = 1.054

100/sqrt(1000) = 3.162 3 times wider.


B is the answer
Ch. 9 - Testing a Claim

P-value < significance level: reject the null hypothesis: EVIDENCE


P-value > significance level: fail to reject the null hypothesis: NO EVIDENCE

Null True False

Reject False Positive (severe) Power


Type I Error

Fail to Reject False Negative


Type II Error

As Power increases, Type II decreases.

A two sided test only allows us to reject (or fail to reject) a hypothesized value for a
particular population parameter.

P(Type I error) = significance level


Power = 1 - P(Type II Error)

You cannot find power from only a type 1 error, or vice versa.

Z test - Proportions (Know sd of the POPULATION)


T-test - Means

Power: Probability of rejecting null hypothesis when it is true.


Type I Error = alpha level
Type II - Assuming null is false,

Random sample is the most important.

Increase Power of a significance test by


● Increasing sample size n
● Increasing significance level
● When the alternative value is farther away than the null hypothesis value.

We don’t ever ‘prove’ anything in a significance test


Z Test -
When using the sample standard deviation to estimate the population standard
deviation (z test) , more variability is introduced into the sampling distribution of the
statistic

T test -
● t-distributions are symmetric
● they are lower at the mean and higher at the tails and so are more spread out
than the normal distribution.
● The greater the df, the closer the t-distributions are to the normal distribution.
● The 68-95-99.7 Rule applies to the z-distribution and will work for t-models with
very large df.
● All probability density curves have an area of 1 below them.
Conditions: normally distributed, n>=30, and symmetrical,

MCQS: “Type I/II interpretation”

Ch. 10 - Compare 2 Populations

Pooling:
2 Proportion Z Intervals DO NOT use pooling
2 Proportion Z Tests DO use pooling

Ch. 11 - Chi Square


Conditions:
SRS: given or quote
Independence n < .10 of population
Large Counts: n1 >= 5, n2 >= 5, etc… (expected values/counts)
test for homogeneity - distribution of a single categorical variable for two or more
populations or treatments. Two samples
A test for independence looks for an association between two or more categorical
variables in a single population. One sample

11.1.)
Goodness of Fit Test:
Ho: The distribution of [context] is uniform in population
Ha: The distribution of [context] is NOT uniform in population

11.2)
Homogeneity Test:
Ho: The distribution of [context] is uniform in population
Ha: The distribution of [context] is NOT uniform in population

Independence Test:
Ho: There is not an association between [context] and [context] in population
Ha: There is an association between [context] and [context] in population

Do: Put observed values in 2nd x^-1, stat tests x^2 test.

Degrees of freedom for Homogeneity and Independence = (rows − 1)(columns − 1)


Degrees of freedom for GOF = # columns - 1
Chi-squared test statistic: (Observed - Expected)2/ Expected
Expected Number for matrix = (row total*col total)/grand total

Calculator programs(ti-84):
Goodness of fit test: x GOFTest(2nd + Vars - > x2GOFTest)
2

X2GOFTest(L1,L2)
Homogeneity: x2-Test( Stat -> Tests - > x2-Test )
x2-Test {A,B}
Independence: x2-Test( Stat -> Tests - > x2-Test )
x2-Test {A,B}
(Enter the observed values into matrix A and the x2-Test will do the rest of your calculations. The
values in matrix B are inputted by the calculator when you finish the test into {B} ).

Clear matrix - 2nd + -> Mem Management -> Matrix -> Del A or B
Relationships:
The more degrees of freedom a chi-square distribution has, the HIGHER the mean will be.
Chi-square distribution with greater than 10 degrees of freedom is roughly symmetric.

These questions ask for expected value. (row total)(column total) / (grand total)
When a question asks for the contribution to the x^2 statistic, just do (O-E)^2 / E

Ch 12 - Slope
12.1
The formula for the confidence interval for the slope is b1 ± t* · standard error·
df = n-2

Linear - residual plot shows no curved pattern


Independent - n<0.10N
Normal - histogram of residual plot shows no strong outliers/skewness
Equal SD - residual plot has roughly equivalent standard dev
Random - random

12.2
log transformation on both sides, do 10^log(x)

“Confidence Interval Slope”

b+= t value * standard error

Standard error = 16.258, will always be under SE Coef and where the slope is.
b = -145.569

df = n-2, df = 23.
invT(0.975, 23) = 2.069
-145.57 +- 2.069*16.258 Answer is C
AP Tips:

1. When asked to describe a one-variable data set, always discuss shape, center, and
spread in context. That means your answer should mention the variable and include
units.

2. If you are asked to compare distributions, use phrases such as greater than, less
than, and the same as. And, again, always answer in context.

3. Understand how skewness can be used to differentiate between the mean and the
median.

4. Know how transformations of a data set affect summary statistics.

5. Be careful when using “normal” as an adjective. Normal refers to a specific model,


not the general shape of a graph of a data set. It’s better to use “mound-shaped and
symmetric,” etc., instead. You will be docked for saying something like, “The shape of
the data set is normal.” No data set is exactly normal. At least, call it “approximately
normal.”

6. Remember that a correlation does not necessarily imply a causal relationship


between two variables. Conversely, the absence of a strong correlation does not mean
there is no relationship (it might not be linear).

7. Be able to use a residual plot to help determine if a linear model for a data set is
appropriate. Be able to explain your reasoning.

8. Recognize that the correlation coefficient (r) measures the strength and direction of a
relation we have reason to believe is linear. The correlation coefficient does NOT tell us
that the linear model is an appropriate model.

9. Be able to interpret, in context, the slope and y-intercept of a least-squares


regression line. Be sure to include “predicted” or “tends to” in your description.

10. Be able to read computer regression output.


11. Know the definition of a simple random sample (SRS).

12. Know the definition of, and reasons for, choosing to do a stratified random sample
instead of a simple random sample.

13. Be able to design an experiment using a completely randomized design.


Understand that an experiment that utilizes blocking cannot, by definition, be a
completely randomized design.

14. Explain the difference between the purposes of randomization and blocking.

15. Be able to describe what blinding and confounding variables are.

16. Clearly describe how to create a simulation for a probability problem.

17. Be clear on the distinction between independent events and mutually exclusive
events (and why mutually exclusive events can’t be independent).

18. Be able to find the mean and standard deviation of a discrete random variable.

19. Recognize binomial and geometric situations.

20. Never forget that hypotheses are always about parameters, never about statistics.

21. Any hypothesis testing procedure involves four steps. Know what they are and that
they must always be there. And never forget that your conclusion in context (Step 4)
must be linked to your calculations (Step 3) in some way.

22. When doing inference problems, remember that you must show that the conditions
for the inference procedure are present. It is not sufficient to simply declare them
present. Realize that you are often not instructed to check the conditions in the question
but you must do so anyway.

23. Be clear on the concepts of Type I and Type II errors and the power of a test.

24. If you are required to construct a confidence interval, remember that there are three
things you must do to receive full credit: justify that the conditions necessary to
construct the interval are present; construct the interval; and interpret the interval in
context. You’ll need to remember this, because often the only instruction you will see is
to construct the interval.

25. If you include graphs as part of your solution, be sure that axes are labeled and that
scales are clearly indicated. This is part of communication

stats medic ultimate interpretation guide Flashcards | Quizlet

You might also like