0% found this document useful (0 votes)
53 views2 pages

Stats 201 Midterm Sheet

The document discusses statistical inference concepts including population parameters, sampling distributions, random sampling, standard error, confidence intervals, and the difference between quantitative and categorical variables. Random sampling aims to obtain a representative sample and reduce bias. The standard error quantifies variability in sampling distributions and is used to calculate confidence intervals.

Uploaded by

Lisbon Anderson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views2 pages

Stats 201 Midterm Sheet

The document discusses statistical inference concepts including population parameters, sampling distributions, random sampling, standard error, confidence intervals, and the difference between quantitative and categorical variables. Random sampling aims to obtain a representative sample and reduce bias. The standard error quantifies variability in sampling distributions and is used to calculate confidence intervals.

Uploaded by

Lisbon Anderson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1 stat inference

When inferential Explain random and representative sampling and how this can influence
Must be a population parameter that you would like to estimate NOT a estimation.
solid number. Statistical inference is the act of making a guess about a
The third set of terms relate to sampling methodology: the
population using a sample
method used to collect samples. You’ll see here and throughout
mean, proportion, median, variance, standard deviation, and correlation
the rest of your book that the way you collect samples directly
often estimated using sample data and write computer scripts to calculate
influences their quality;
estimates of these parameters
A sample is said to be representative if it roughly
mean(), median(), var(), sd(), cor()
“looks like” the population. In other words, if the
virtual_prop_interest <- samples %>% sample’s characteristics are a “good” representation of
the population’s characteristics.
group_by(replicate) %>%
We say a sample is generalizable if any results based
summarize(Interest = sum(category == "Interest")) %>%
on the sample can generalize to the population. In
mutate(propInterest = Interest / n) n = samplesSizesN other words, if we can make “good” guesses about the
population using the sample.
Virtual_prop_interest
We say a sampling procedure is biased if certain individuals in a
Population
population have a higher chance of being included in a sample than others.
A population is a collection of individuals or observations that we are We say a sampling procedure is unbiased if every individual in a
interested in (a study population) population has an equal chance of being sampled.
Sample Define random variables and explain how they relate to sampling.
is the act of collecting a sample from the population, which we generally Before we take sample
only do when we can’t perform a census. We mathematically denote the The elements of the sample are random
sample size using lowercase n , as opposed to uppercase N which denotes The sample proportion is random
the population’s size. Typically the sample size n is much smaller than the The sample standard error is random
population size N. Thus sampling is a much cheaper alternative than The boundaries of a confidence interval are random
performing a census. The population parameter is constant
population parameters
After we take the sample:
A population parameter is a numerical summary of interest about the The elements of the sample are constant
population like mean median variance etc. The sample proportion is constant
Estimate The sample standard error is constant
The boundaries of a confidence interval are constant
A point estimate, also known as a sample statistic, is a summary statistic The population parameter is constant
computed from a sample that estimates the unknown population The elements of bootstrap samples are random
parameter.
sampling distribution Define standard error and explain its purpose.

Is a distribution of sample means which has standard error not sd. Taken Sd of sampling distribution of sample means
with repeated sampling from a single sample to see the distribution of population distribution
parameter of interest Collection of mean heights from repeated samples of
30 students, showing how the average height varies across different The population distribution refers to the distribution of a
samples. particular variable in the entire population of interest.

sample distribution. It describes the frequency or probability of each possible


outcome within the entire group.
Distribution of values within a single sample collected from a
population focusing on a single data set Often, the entire population distribution is unknown or difficult
to measure entirely.
Sample distribution: Heights of 30 randomly selected
students in a school. sample distribution

How to draw random samples from a finite population (e.g., census data) A sample distribution is the distribution of a specific variable
within a sample drawn from the population.
rep_sample_n(reps = x, size = x, replace = TRUE/FALSE)
(based on is it is from the if it from the population or from the It provides insights into the characteristics of the sample and
sample (bootstrap)) helps make inferences about the population.

How to estimate a sampling distribution for a given statistic and Sample distributions may vary from one sample to another.
population. estimator's sampling distribution.
Define Population: Clearly define the population of interest. When estimating a parameter (e.g., mean or proportion) using a
Identify Statistic,Determine Sample Size,Sampling Procedure,Generate sample statistic (e.g., sample mean or sample proportion), the
Multiple Samples,Calculate Statistic,Create Frequency distribution of the statistic across all possible samples is called
Distribution,Analyze Sampling Distribution,Calculate Confidence the estimator's sampling distribution.
Intervals, Assess Precision and Bias
2 cont stat inference Explain what a sampling distribution is, list its properties, and its purpose
Compare and contrast quantitative and categorical variables. in statistical inference.
Categorial words and qualities not numerical quantitative is
number
The sampling distribution is the distribution of a statistic (e.g., What is a confidence interval
mean, variance, proportion) for all possible samples of a given A confidence interval, in statistics, refers to the probability that a
size from a population. population parameter will fall between a set of values for a
certain proportion of times based on the confidence level(90,95,
Properties
99)
Central Limit Theorem: Regardless of the shape of the population Given that the distribution is normal you can use standard error
distribution, the sampling distribution of the sample mean approaches a to calculate the ci. Since the bootstrap approximates the
normal distribution as the sample size increases. sampling can hand calculate the interval
x±1.96⋅SE=(¯¯¯x−1.96⋅SE,¯¯¯x+1.96⋅SE)=(1995.4
Mean of Sampling Distribution: The mean of the sampling distribution
4−1.96⋅2.15,1995.44+1.96⋅2.15)=(1991.15,1999.73)
is equal to the population parameter being estimated (unbiased estimator).
What data do you use to get ci
Standard Deviation of Sampling Distribution (Standard Error): The Bootstrapped from sampling
standard deviation of the sampling distribution is known as the standard How to calculate
error. It quantifies how much the sample statistic is expected to vary from #sample from sample
the true population parameter. Bootstrap <- pennies_sample %>%
Shape: For large sample sizes, the sampling distribution tends to be rep_sample_n(size = 50, replace = TRUE, reps =
normal, even if the population distribution is not. 1000) %>%
group_by(replicate) %>%
Spread: Larger sample sizes result in smaller standard errors and, summarize(mean_year = mean(year))
therefore, a more precise estimate of the population parameter. #Visualize boot
3 bootstrapping Plot <- bootstrap |>
Define bootstrapping ggplot(aes(x = mean_year)) +
Given a population with some population parameter you can geom_histogram() +
take samples over and over from it to get a sampling distribution labs()
of the population parameter. The bootstrap distribution is going #get ci
to approximate the sampling distribution NOT the population. ci <- bootstrap |>
Calc summarize(ci_lower = quantile(mean_year, 0.05),
Bootstrap <- sampling_dist |> (given we want 90% diff=ci)
rep_sample_n(reps = x, size = x, replace = ci_upper = quantile(mean_year, 0.95))
TRUE) #graph ci
ci_plot <- bootstrap %>%
# Take 1000 virtual samples of size 50 from the bowl: ggplot(aes(x = mean_year)) +
virtual_samples <- bowl %>% geom_histogram(binwidth = 1) +
rep_sample_n(size = 50, reps = annotate("rect", xmin = ci$ci_lower, xmax =
1000) ci$ci_upper, ymin = 0, ymax = Inf) +
# Compute the sampling distribution of 1000 values of p-hat geom_vline(xintercept = population_mean,
sampling_distribution <- virtual_samples size = 2,
%>% colour = "red") +
group_by(replicate) %>% labs(title = "Bootstrap distribution with 90%
summarize(red = sum(color == confidence interval",
"red")) %>% x = "Mean year")
mutate(prop_red = red / 50) Infer version
# Visualize sampling distribution of p-hat #bootstrap & visualize
ggplot(sampling_distribution, aes(x = Bootstrap <- pennies_sample |>
prop_red)) + specify(response = year) |>
geom_histogram(binwidth = 0.05, boundary generate(reps = 1000) |>
= 0.4, color = "white") + calculate(stat = “mean”) |>
labs(x = "Proportion of 50 balls that were #forgraph visualize()
red", #get ci
title = "Sampling distribution") percentile_ci <- bootstrap %>%
get_confidence_interval(level = 0.90, type =
Infer version "percentile")
bootstrap_distribution <- sample_1 %>% #plot ci
specify(response = parameterInterest, Plot <- bootstrap |>
success = "typeInterest") %>% visualize() +
generate(reps = 1000, type = "bootstrap") shade_confidence_interval(endpoints = percentile_ci)
%>%
calculate(stat = "prop") What can you say about the results from the ci
Why bootstrap The effectiveness of a confidence interval is judged by whether or not it
No access and can approx the sampling dist contains the true value of the population parameter.
Sampling vs bootstrap
While both sampling and bootstrap are techniques used in population -> parameters
statistics, they differ in their approach, assumptions, and sample -> point estimate -> estimates -> Parameter(1)
applications. Sampling focuses on drawing representative sample -> Estimator(2) -> standard error
samples from populations, while bootstrap is a resampling bootstrap samples -> bootstrap distribution -> Estimates(3) ->
technique used to estimate the sampling distribution of a statistic sampling distribution
from observed data. bootstrap distribution -> Standard deviation(4) -> estimates ->
Standard error (5)

You might also like