CS210 Statistics Notes PDF
CS210 Statistics Notes PDF
Data Basics
-
all var!ables -O
Y -
Categor!cal numer!cal
L -
ord!nal ↓ L
Categor!cal d!screte cont!nuous
assoc!ated =
dependent
Associated vs. Independent !ndependent
-When two variables show some connection with one another, they are called associated/dependent variables.
-If two variables are not associated, ie. there is no evident connection between the two, then they are said to be independent.
*Anecdotal Evidence: “My uncle smokes three packs a day and he’s in perfectly good health”
Sampling Bias
Non-Response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no
longer choose to respond to a survey, the sample may no longer be representative of the population.
Voluntary Response: Occurs when the sample consists of people who volunteer to respond because they have strong
opinions on the issue. Such a sample will also not be representative of the population.
Convenience Sample: Individuals who are easily accessible are more likely to be included in the sample.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Explanatory variable —(might affect)—> Response variable
Observational Study: Researchers collect data in a way that does not directly interfere with how the data arise, ie. they
merely “observe”, and can only establish an association between the explanatory and response variables.
Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between
the explanatory and response variables.
//CORRELATION DOES NOT IMPLY CAUSATION//
—>Main difference between observational studies and experiments:
Most experiments use random assignment while observational studies do not.
3 Possible Explanations
1-Eating breakfast causes girls to be thinner.
2-Being thin causes girls to eat breakfast.
3-A third variable is responsible for both. What could it be?
—>An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a
relationship between the two are called confounding variables.
Retrospective Study: Collect data after events have taken place. after
retrospect!ve >
-
collect data
events
Obtaining Good Samples
Almost all statistical methods are based on the notion of implied randomness.
If observational data are not collected in a random framework from a population, these statistical methods - the estimates and
errors associated with the estimates - are not reliable.
-Simple Random Sample: -Stratified Sample: Strata are -Cluster Sample: Clusters are
Randomly select cases from the made up of similar observations. We usually not made up of homogenous
population, where there is no implied take a simple random sample from observations.
connection between the points that each stratum. Fig1 We take a simple random
are selected. sample of clusters, and then sample
all observations in that cluster.
Usually preferred for economical
reasons.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Principles of Experimental Design
1-Control: Compare treatment of interest to a control group.
2-Randomize: Randomly assign subjects to treatments, and randomly
sample from the population whenever possible.
3-Replicate: Within a study, replicate by collecting a sufficiently large
sample. Or replicate the entire study.
4-Block: If there are variables that are known or suspected to affect the response variable, first group subjects into
blocks based not these variables, and then randomize cases within each block to treatment groups.
Ex. We would like to design an experiment to investigate if energy gels makes you run faster. It’s suspected that
energy gels might affect pro and amateur athletes differently, therefore we block for pro status.
Another ex. A study is designed to test the effect of light level and noise level on exam performance of students. The
researcher also believes that light and noise levels might have different effects on males and females, so wants to
make sure both genders are equally represented in each group.
Explanatory Variable(s): Light level & Noise Level
Response Variable(s): Exam Performance
Blocking Variable(s): Gender
Mean X Xz t t kn
X n
samplemean
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
The population mean is also computed the same way but is denoted as mu. It is often not possible to calculate mu since the
population data are rarely available.
M
The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be
perfect, but it the sample is good (representative of the population), it is usually a pretty good estimate.
Standard Deviation s = (s^2)^1/2 is the square root of the variance, and has the same units as the data.
Median is the value that splits the data in half when ordered in ascending order.
Outliers
Why is it important to look for outliers?
-Identify extreme skew in the distribution.
-Identify data collection and entry errors.
-Provide insight into interesting features of the data.
Robust Statistics
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
For symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread
***When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log
transformation.
Pros & Cons of Transformations: Skewed data are easier to model with when they are transformed because outliers tend
to become far less prominent after an appropriate transformation.
However, results of an analysis might be difficult to interpret because the log of a measured variable is usually meaningless.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
***A table that summarizes data for two categorical variables is called a contingency table.
OS02 Probability
A random process is a situation in which we know what outcomes could happen, but we don’t know which particular outcome
will happen.
P(A)=Probability of event A
0<=P(A)<=1
—>Frequentist interpretation:
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an
infinite number of times.
—>Bayesian interpretation:
A Bayesian interprets probability as a subjective degree of belief: For the same event, two separate people could have
different viewpoints and so assign different probabilities.
Largely popularized by revolutionary advance in computational technology and methods during the last twenty years.
Law of Large Numbers: As more observations are collected, the proportion of occurrences with a particular outcome, p ,
n
converges to the probability of that outcome, p.
Law of Averages(Gambler’s Fallacy): The common misunderstanding of the LLN is that random processes are supposed
to compensate for whatever happened in the past; this is just not true.
Independence: Two processes are independent if knowing the outcome of one provides no useful information about the
outcome of the other.
—>Checking for independence:
If P(A occurs, given that B is true) = P(A|B) = P(A), then A and B are independent.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
*Determining Dependence Based On Sample Data*
If conditional probabilities calculated based on sample data suggest dependence between two variables, the next step is to
conduct a hypothesis test to determine if the observed difference between the probabilities is likely or unlikely to have
happened by chance.
If the observed difference between the conditional probabilities is large, then there is stronger evidence that the difference is
real.
If a sample is large, then even a small difference can provide strong evidence of a real difference.
Bayes’ Theorem
É
Random Variables:
A random variable is a numeric quantity whose value depends on the outcome of a random event. P(X=x)
—Discrete Random Variables: Often take only integer values.
i
—Continuous Random Variables: Take real(decimal) values
Expectation
We are often interested in the average outcome of a random variable. We call this the expected value(mean), and it is a
weighted average of the possible outcomes. M EX ExiPlXxi
OS03: Distributions of Random Variables
Normal Distribution
M
Unimodal and symmetric, bell shaped curve
Many variables are nearly normal, but none are exactly normal
Denoted as
Mms —>Normal with mean n and standard deviation
M3am26 nG M Mepasspts
Observationmean
2
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Z SD
Standardizing with Z scores
Standardized score or Z score of an observation is the number of standard deviations it falls above or below the mean.
Z scores are defined for distributions of any shape, but only when the distribution is normal can we use Z scores to calculate
percentiles.
Observations that are more than 2SD away from the mean (|Z| > 2) are usually considered unusual.
Percentile
Percentile is the percentage of observations that fall below a given data point.
Graphically, percentile is the area below the probability distribution curve to the left of that observation.
>>One can use Z-table.
Six Sigma:
“The term six sigma process comes from the notion that if one has six standard deviations between the process mean and the
nearest specification limit, no items will fail to meet specifications.”
Sampling Distribution
A sampling distribution is a probability distribution of a statistic obtained from a larger
X n number
mean of samples
SE drawn from a
p En
specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different
outcomes that could possibly occur for a statistic of a population.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
use s.
Certain conditions must be met for the CLT to apply: In general n should be >30 || <10%
1–Independence: Sampled observations must be independent. This is difficult to verify, but is more likely if
random sampling/assignment is used & if sampling without replacement, n<10% of the population
2–Sample size/skew: Either the population distribution is normal, or if the population distribution is skewed, the sample size is
large.
the more skewed the population distribution, the larger sample size we need for the CLT to apply
for moderately skewed distributions n>30 is a widely used rule of thumb
Confidence Intervals: A plausible range of values for the population parameter is called a confidence interval.
The approximate 95% confidence interval is defined as (point estimate) +- 2xSE where SE = s/(n)^1/2
Confidence interval, a general formula
point estimate +- z* x SE
—>Width of an interval: If we want to be more certain that we capture the population parameter, i.e. increase our confidence
level, we should use a wider interval.
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00
https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Powered by TCPDF (www.tcpdf.org)