0% found this document useful (0 votes)
26 views8 pages

CS210 Statistics Notes PDF

Uploaded by

kemalefekolayli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

CS210 Statistics Notes PDF

Uploaded by

kemalefekolayli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Suppose there was a medical study and the outcomes showed that almost 70% of the treatment group

and 19% of the


control group showed good results.
—>Do the data show a “real” difference between the groups?
The observed between the two groups might be real, or may be due to natural variation.
Since the difference is quite large, it is more believable that the difference is real.
!We need statistical tools to determine if the difference is so large that we should reject the notion that it was due to chance.

Data Basics

-
all var!ables -O
Y -
Categor!cal numer!cal
L -
ord!nal ↓ L
Categor!cal d!screte cont!nuous

assoc!ated =
dependent
Associated vs. Independent !ndependent
-When two variables show some connection with one another, they are called associated/dependent variables.
-If two variables are not associated, ie. there is no evident connection between the two, then they are said to be independent.

Overview of Data Collection Principles


Research Question, Population of Interest, Sample, Population to which results can be generalized

*Anecdotal Evidence: “My uncle smokes three packs a day and he’s in perfectly good health”

*Census: Sampling the entire population


—>There are problems with it: it can be difficult to complete a census, populations rarely stand still, may be more complex
than sampling.

Exploratory Analysis to Interference


Sampling is natural.
Think about sampling something you are cooking - you taste(examine) a small part of what you’re cooking to get an idea about
the dish as a whole.
When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis.
If you generalize and conclude that you entire soup needs salt, that’s an inference.
For your inference to be valid, the spoonful you tasted(the sample) needs to be representative of the entire pot (the population)
If your spoonful comes only from the surface and the salt is collected at the bottom of the pot Ti what you tasted is
probably not representative of the whole pot.

Sampling Bias
Non-Response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no
longer choose to respond to a survey, the sample may no longer be representative of the population.
Voluntary Response: Occurs when the sample consists of people who volunteer to respond because they have strong
opinions on the issue. Such a sample will also not be representative of the population.
Convenience Sample: Individuals who are easily accessible are more likely to be included in the sample.

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Explanatory variable —(might affect)—> Response variable

Observational Study: Researchers collect data in a way that does not directly interfere with how the data arise, ie. they
merely “observe”, and can only establish an association between the explanatory and response variables.
Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between
the explanatory and response variables.
//CORRELATION DOES NOT IMPLY CAUSATION//
—>Main difference between observational studies and experiments:
Most experiments use random assignment while observational studies do not.

Observational Studies and Sampling Strategies


“New study sponsored by General Mills says that eating breakfast makes girls thinner.”
This is an observational study.
The conclusion is that there is a n association between girls eating breakfast and being slimmer.
The study is sponsored by General Mills.

3 Possible Explanations
1-Eating breakfast causes girls to be thinner.
2-Being thin causes girls to eat breakfast.
3-A third variable is responsible for both. What could it be?
—>An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a
relationship between the two are called confounding variables.

collect data as event


Prospective Study: Identifies individuals and collects information as events unfold. prospect!ve >
-

Retrospective Study: Collect data after events have taken place. after
retrospect!ve >
-
collect data

events
Obtaining Good Samples
Almost all statistical methods are based on the notion of implied randomness.
If observational data are not collected in a random framework from a population, these statistical methods - the estimates and
errors associated with the estimates - are not reliable.

-Simple Random Sample: -Stratified Sample: Strata are -Cluster Sample: Clusters are
Randomly select cases from the made up of similar observations. We usually not made up of homogenous
population, where there is no implied take a simple random sample from observations.
connection between the points that each stratum. Fig1 We take a simple random
are selected. sample of clusters, and then sample
all observations in that cluster.
Usually preferred for economical
reasons.

Most commonly used random sampling techniques:

Difference between Stratified and Cluster Sample:


-Multistage Sample:
Stratified sampling divides a population into groups, then includes some
Fig2 We take a simple random
members of all of the groups. sample of clusters, and then take a
Cluster sampling divides a population into groups, then includes all members simple random sample of
observations from the sampled
of some randomly chosen groups
clusters.

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Principles of Experimental Design
1-Control: Compare treatment of interest to a control group.
2-Randomize: Randomly assign subjects to treatments, and randomly
sample from the population whenever possible.
3-Replicate: Within a study, replicate by collecting a sufficiently large
sample. Or replicate the entire study.
4-Block: If there are variables that are known or suspected to affect the response variable, first group subjects into
blocks based not these variables, and then randomize cases within each block to treatment groups.
Ex. We would like to design an experiment to investigate if energy gels makes you run faster. It’s suspected that
energy gels might affect pro and amateur athletes differently, therefore we block for pro status.

Another ex. A study is designed to test the effect of light level and noise level on exam performance of students. The
researcher also believes that light and noise levels might have different effects on males and females, so wants to
make sure both genders are equally represented in each group.
Explanatory Variable(s): Light level & Noise Level
Response Variable(s): Exam Performance
Blocking Variable(s): Gender

Difference Between Blocking and Explanatory Variables


—Factors are conditions we can impose on the experimental units.
—Blocking variables are characteristics that the experimental units come with, that we would like to control for.
—Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when sampling.

More experimental design terminology:


—Placebo: Fake treatment, often used as the control group for medical studies.
—Placebo effect: Experimental units showing improvement because they believe they are receiving special treatment.
—Blinding: When experimental units do not know whether they are in the control or treatment group.
—Double-blind: When both the experimental units and the researchers who interact with the patients do not know the is in the
control and who is in the treatment group.

Examining Numerical Data:


—>Data Visualization Stuff

Mean X Xz t t kn
X n
samplemean

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
The population mean is also computed the same way but is denoted as mu. It is often not possible to calculate mu since the
population data are rarely available.
M
The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be
perfect, but it the sample is good (representative of the population), it is usually a pretty good estimate.

Commonly Observed Shapes of Distributions


Unimodal(Single Prominent Peak), Bimodal/Multimodal(2+ prominent peaks), Uniform(no apparent peaks)
Skewness(Right/Left/Symmetric)

Variance Ei Xi X is roughly the average squared deviation from the mean.


s n t

Why do we use the squared deviation in the calculation of variance?


->To get rid of negatives so that the observations equally distant from the mean are weighed equally.
->To weigh larger deviations more heavily.

Standard Deviation s = (s^2)^1/2 is the square root of the variance, and has the same units as the data.

Median is the value that splits the data in half when ordered in ascending order.

Q1, Q2 and IQR


The 25th percentile is also called the first quartile, Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR.
—>IQR = Q3 - Q1

Outliers
Why is it important to look for outliers?
-Identify extreme skew in the distribution.
-Identify data collection and entry errors.
-Provide insight into interesting features of the data.

Robust Statistics
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
For symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Mean vs. Median


If the distribution is symmetric, center is often defined as the mean (mean median)
If the distribution is skewed or has extreme outliers, center is often defined as the median
Right-skewed: mean>median
Left-skewed: mean<median

***When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log
transformation.
Pros & Cons of Transformations: Skewed data are easier to model with when they are transformed because outliers tend
to become far less prominent after an appropriate transformation.
However, results of an analysis might be difficult to interpret because the log of a measured variable is usually meaningless.

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
***A table that summarizes data for two categorical variables is called a contingency table.

Hypothesis Testing Framework:


—We start with a null hypothesis(H0) that represents the status quo.
—We also have an alternative hypothesis(HA) that represents our research question, i.e. what we’re testing for.
—We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation or theoretical
methods.
—If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the
null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative.

OS02 Probability
A random process is a situation in which we know what outcomes could happen, but we don’t know which particular outcome
will happen.
P(A)=Probability of event A
0<=P(A)<=1

—>Frequentist interpretation:
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an
infinite number of times.
—>Bayesian interpretation:
A Bayesian interprets probability as a subjective degree of belief: For the same event, two separate people could have
different viewpoints and so assign different probabilities.
Largely popularized by revolutionary advance in computational technology and methods during the last twenty years.

Law of Large Numbers: As more observations are collected, the proportion of occurrences with a particular outcome, p ,
n
converges to the probability of that outcome, p.

Law of Averages(Gambler’s Fallacy): The common misunderstanding of the LLN is that random processes are supposed
to compensate for whatever happened in the past; this is just not true.

—Disjoint(Mutually Exclusive) Outcomes: Cannot happen at the same time.


—Non-Disjoint Outcomes: Can happen at the same time.

General Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)

Probability Distribution: A probability distribution lists all possible


events and the probabilities with which they occur.

Sample Space: The collection of all possible outcomes of a trial.


Ex. Sample Space of the gender of one kid = S = {M,F}
Sample Space of the genders of two kids = S = {MM, FF, MF, FM}
Complementary Events: Two mutually exclusive events whose probabilities that add up to 1.
Ex. A couple has two kids, if we know that they are not both girls what are the possible gender combinations for these kids?
{MM, FM, MF}

Independence: Two processes are independent if knowing the outcome of one provides no useful information about the
outcome of the other.
—>Checking for independence:
If P(A occurs, given that B is true) = P(A|B) = P(A), then A and B are independent.

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
*Determining Dependence Based On Sample Data*
If conditional probabilities calculated based on sample data suggest dependence between two variables, the next step is to
conduct a hypothesis test to determine if the observed difference between the probabilities is likely or unlikely to have
happened by chance.
If the observed difference between the conditional probabilities is large, then there is stronger evidence that the difference is
real.
If a sample is large, then even a small difference can provide strong evidence of a real difference.

Product Rule For Independent Events: P(A and B) = P(A) x P(B)

Q: Do the sum of probabilities of two disjoint events always add up to 1?


A: Not necessarily, there may be more than 2 events in the sample space, e.g party affiliation.

Q: Do the sum of probabilities of two complementary events always add up to 1?


A: Yes, that’s the definition of complementary, e.g. heads and tails.

Conditional Probability: P(A|B) = P(A and B)/P(B)

Independence and conditional probabilities


Generically, if P(A|B) = P(A) then the events A and B are said to be independent.
-Conceptually: Giving B doesn’t tell us anything about A.
-Mathematically: We know that if events A and B are independent, P(A and B) = P(A) x P(B)
P(A|B) = P(A and B)/P(B) = P(A)xP(B)/P(B) = P(A)

Bayes’ Theorem

É
Random Variables:
A random variable is a numeric quantity whose value depends on the outcome of a random event. P(X=x)
—Discrete Random Variables: Often take only integer values.
i
—Continuous Random Variables: Take real(decimal) values

Expectation
We are often interested in the average outcome of a random variable. We call this the expected value(mean), and it is a
weighted average of the possible outcomes. M EX ExiPlXxi
OS03: Distributions of Random Variables

Normal Distribution

M
Unimodal and symmetric, bell shaped curve
Many variables are nearly normal, but none are exactly normal
Denoted as
Mms —>Normal with mean n and standard deviation
M3am26 nG M Mepasspts

Observationmean
2
This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Z SD
Standardizing with Z scores
Standardized score or Z score of an observation is the number of standard deviations it falls above or below the mean.

Z scores are defined for distributions of any shape, but only when the distribution is normal can we use Z scores to calculate
percentiles.
Observations that are more than 2SD away from the mean (|Z| > 2) are usually considered unusual.

Percentile
Percentile is the percentage of observations that fall below a given data point.
Graphically, percentile is the area below the probability distribution curve to the left of that observation.
>>One can use Z-table.

Six Sigma:
“The term six sigma process comes from the notion that if one has six standard deviations between the process mean and the
nearest specification limit, no items will fail to meet specifications.”

Normal Probability Plot: Anatomy of a normal probability plot


Data are plotted on the y-axis of a normal probability plot, and theoretical quantiles (following a normal distribution) on
the x-axis.
If there is a linear relationship in the plot, then the data follow a nearly normal distribution.
Constructing a normal probability plot requires calculating percentiles and corresponding z-scores for each observation,
which is tedious. Therefore we generally rely on software when making these plots.

OS04: Foundations for inference


Variability in Estimates
“Margin of sampling error is plus or minus 2.9 percentage points for results based on the total sample and 4.4 percentage
points for adults ages 18-34 at the 95% confidence level.”
—41% +- 2.9%: We are 95% confident that 38.1% to 43.9% of the public believe young adults, rather that middle-aged or
older adults, are having the toughest time in today’s economy.
—49% +- 4.4%: We are 95% confident that 44.6% to 53.4% of 18-34 years olds have taken a job they didn’t want just to pay
the bills.

—We are often interested in population parameters.


—Since complete populations are difficult (or impossible) to collect data on, we use sample statistics as point estimates for
the unknown population parameters of interest.
—Sample statistics vary from sample to sample.
—Quantifying how sample statistics vary provides a way to estimate the margin of error associated with our point estimate.

Sampling Distribution
A sampling distribution is a probability distribution of a statistic obtained from a larger
X n number
mean of samples
SE drawn from a
p En
specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different
outcomes that could possibly occur for a statistic of a population.

Central Limit Theorem


The distribution of the sample mean is well approximated by a normal model:
where SE represents standard error, which is defined as the standard deviation of the sampling distribution. If is unknown,

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
use s.

Certain conditions must be met for the CLT to apply: In general n should be >30 || <10%
1–Independence: Sampled observations must be independent. This is difficult to verify, but is more likely if
random sampling/assignment is used & if sampling without replacement, n<10% of the population
2–Sample size/skew: Either the population distribution is normal, or if the population distribution is skewed, the sample size is
large.
the more skewed the population distribution, the larger sample size we need for the CLT to apply
for moderately skewed distributions n>30 is a widely used rule of thumb

Confidence Intervals: A plausible range of values for the population parameter is called a confidence interval.
The approximate 95% confidence interval is defined as (point estimate) +- 2xSE where SE = s/(n)^1/2
Confidence interval, a general formula
point estimate +- z* x SE

—>Width of an interval: If we want to be more certain that we capture the population parameter, i.e. increase our confidence
level, we should use a wider interval.

This study source was downloaded by 100000859810404 from CourseHero.com on 01-04-2024 09:51:09 GMT -06:00

https://www.coursehero.com/file/187263415/CS210-Statistics-Notespdf/
Powered by TCPDF (www.tcpdf.org)

You might also like