0% found this document useful (0 votes)
40 views27 pages

STAT100 - Full Course Notes

asd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views27 pages

STAT100 - Full Course Notes

asd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 1 and 2 - Introduction to data and Summarizing data

Understand the basic structure of a data set and the different types of variables.

Rows (Observations/Records): Each row corresponds to a single unit of analysis, such as a person,
product, or event. It is an instance of the data being studied.

Columns (Variables/Attributes/Features): Each column represents a characteristic or property of the


observation. Variables can be of different types and hold different forms of data.

Categorical (Qualitative) Variables:

Nominal: These variables represent categories without any inherent order. For example, gender (male,
female), religion, or product types.

Ordinal: These variables represent categories with a meaningful order or ranking, but the differences
between ranks are not necessarily equal. For example, customer satisfaction (low, medium, high) or
education level (high school, bachelor's, master's).

Numerical (Quantitative) Variables:

Discrete: These are countable variables that take specific values, often integers. For example, number of
children, number of cars owned.

Continuous: These variables can take any value within a range and are often measured. Examples
include height, weight, and temperature.

Chapter Understand the statistical concepts behind unbiased and random data
collection.

Unbiased data collection - In statistics, bias refers to systematic errors or deviations that cause the
data to misrepresent the true characteristics of the population. The source of bias is:

Selection Bias: Occurs when the sample does not accurately represent the population, such as when
certain subgroups are underrepresented or overrepresented.

Measurement Bias: Happens when the method of data collection systematically distorts the data, such
as poorly calibrated instruments or leading survey questions.

Non-response Bias: Results from a significant portion of the sample not participating, potentially leading
to unrepresentative results.

Random Data Collection - Random data collection refers to the process of selecting individuals or
units from a population in such a way that each one has an equal and independent chance of being
included in the sample. Key Random Sampling Techniques:
Simple Random Sampling (SRS): Each member of the population has an equal chance of being selected.
This can be achieved using random number generators or lottery methods.

Stratified Sampling: The population is divided into subgroups (strata) based on certain characteristics,
and random samples are drawn from each subgroup. This ensures representation from all relevant
subgroups.

Cluster Sampling: The population is divided into clusters, and entire clusters are randomly selected for
the sample. This is useful when the population is large or geographically dispersed.

Systematic Sampling: Every nth member of the population is selected after randomly choosing a starting
point. It is easier to implement but requires caution to avoid periodic patterns in the population.

Be able to differentiate between observational studies and experiments.

Observation studies - An observational study involves collecting data on subjects without manipulating
the environment or the variables of interest.

Experiments - An experiment is a controlled study where the researcher deliberately manipulates one or
more variables (called independent variables) to observe the effect on another variable (called the
dependent variable), often with the goal of establishing causality.

Be familiar with how to display and calculate basic summary statistics from numerical
data.

Mean - the sum of all data points divided by the number of data points.

Median - Is the middle value of the data when sorted. If there’s an odd number of values, it’s the middle
one. If there’s an even number, it’s the average of the two middle values.

Mode - Is the value(s) that appear most frequently in the dataset. A dataset may have one mode, more
than one mode, or no mode at all.

Range - Is the difference between the largest and smallest values in the dataset.

Chapter 3 - Probability
Addition Rule example:

Multiplication Rule for Probability:

Marginal and Joint Probabilities (Disjoint Events):


Conditional Probability:
Chapter 4 - Distribution of Random Variables

Describe a normal distribution using correct notation

A normal distribution is a probability distribution with a bell-shaped curve, symmetric about its mean (μ)
and characterized by its mean and standard deviation (σ). The notation is: X∼N(μ,σ^2)

Calculate and interpret z-scores for given values from a normal distribution.
Calculate probabilities (percentiles) for a normal distribution using R

To calculate probabilities (i.e., the area under the curve) for a normal distribution, the pnorm() function
is used in R.
Example: The probability of a z-score less than 1.96 is:
pnorm(1.96)

This gives approximately 0.975, meaning 97.5% of values lie below a z-score of 1.96.

Calculate cut-off points (quantiles) for a normal distribution using R

To find quantiles (e.g., the 95th percentile), you can use the qnorm() function in R.

Example: The 95th percentile of a normal distribution with mean 0 and standard deviation 1 is:

qnorm(0.95)

This gives approximately 1.645, meaning that 95% of the data lies below this z-score.

Understand and apply the 68-95-99.7 (Empirical) Rule

This rule states that for a normal distribution:

68% of the data falls within 1 standard deviation of the mean (μ±σ).

95% falls within 2 standard deviations (μ±2σ).

99.7% falls within 3 standard deviations (μ±3σ).

Example: If test scores are normally distributed with a mean of 75 and a standard deviation of 5, 95% of
students scored between 65 and 85.
Evaluate the normal approximation using a normal probability plot.

A normal probability plot (Histograms, boxplots. normal Q-Q plots) is used to assess whether data
follows a normal distribution. If the data is normal, the points will lie along a straight line. Deviations
from this line indicate departures from normality.

Example: In R, you can create a normal probability plot with:

qqnorm(data)

qqline(data)

List the necessary conditions for a binomial experiment

A binomial experiment must meet the following criteria:

- Fixed number of trials (n).

- Each trial has two possible outcomes: success or failure.

- The probability of success pp is constant for each trial.

- The trials are independent.

Example: Tossing a fair coin 10 times is a binomial experiment, where each toss has a probability of
success (heads) p=0.5p=0.5.

Calculate the mean, variance and standard deviation of a binomial random variable.
Calculate probabilities for a binomial distribution using R.

The dbinom() function calculates the probability of a given number of successes in a binomial
experiment.

Example: To calculate the probability of getting exactly 4 heads in 10 flips of a fair coin:

dbinom(4, size = 10, prob 0.5)

This returns approximately 0.205, meaning a 20.5% chance of exactly 4 heads.

Each concept links to key principles of statistics and probability, especially useful in hypothesis testing,
data analysis, and decision-making in uncertain conditions.

Chapter 5 - Foundations of Inference

Describe the difference between descriptive statistics and inferential statistics.

Descriptive Statistics: These are methods for summarizing and organizing data. Descriptive statistics
provide a way to present data in a meaningful way. Common measures include mean, median, mode,
variance, and standard deviation. Example: If you have a dataset of the test scores of a class of students,
you could calculate the average score (mean), the highest score (maximum), and the lowest score
(minimum) to describe the performance of the class.

Inferential Statistics: These involve making predictions or inferences about a population based on a
sample of data. Inferential statistics often use probability theory to make conclusions that extend beyond
the data at hand. Example: If you want to estimate the average height of all adults in a city, you might
take a sample of 100 adults and calculate the sample mean, then use that mean to infer the average
height of the entire adult population.

Differentiate between population parameters and sample statistics using


appropriatenotation.
Population Parameters: These are numerical characteristics of a population. They are typically denoted
by Greek letters. Example Notation: The population mean is denoted as μ, and the population standard
deviation is denoted as σ.

Sample Statistics: These are numerical characteristics of a sample. They are typically denoted by Latin
letters. Example Notation: The sample mean is denoted as x̄, and the sample standard deviation is
denoted as s.

Describe the sampling distribution of a sample mean.

The sampling distribution of the sample mean is the probability distribution of all possible sample
means from a population. As sample sizes increase, the sampling distribution of the sample mean
approaches a normal distribution, regardless of the shape of the population distribution, provided the
sample size is sufficiently large. Some Key Characteristics:

- The mean of the sampling distribution is equal to the population mean (μ).

- The standard deviation of the sampling distribution (standard error) is σ/√n, where σ is the population
standard deviation and nn is the sample size.

State, understand and apply the Central Limit Theorem.

The Central Limit Theorem states that the sampling distribution of the sample mean will approach a
normal distribution as the sample size nn becomes large (usually n≥30), regardless of the shape of the
population distribution.

- Application: This theorem is fundamental in inferential statistics as it allows for the use of normal
probability methods to make inferences about the population mean based on sample data.

Calculate and interpret an approximate 95% Confidence Interval for a population


mean.
Conduct a test of hypothesis about a population mean based on large samples.
Understand decision errors and explain the difference between Type I and Type II
errors.

Type I Error (α): Occurs when the null hypothesis is true but is incorrectly rejected. This is often
considered a "false positive." Example: Concluding a new drug is effective when it is not.

Type II Error (β): Occurs when the null hypothesis is false but is incorrectly accepted. This is often
considered a "false negative." Example: Concluding a new drug is ineffective when it is actually effective.

Explain the difference between practical significance and statistical significance.

Statistical Significance: A result is statistically significant if it is unlikely to have occurred by random


chance, as determined by a hypothesis test (typically, a p-value < 0.05). Example: Finding a p-value of
0.03 means there is strong evidence against the null hypothesis.

Practical Significance: This refers to the real-world relevance or importance of a statistically significant
result. A result can be statistically significant but not practically significant if the effect size is small.
Example: A new medication might reduce symptoms by a statistically significant amount, but if the
reduction is negligible in everyday terms, it may not be considered practically significant.

Understand the necessary conditions and assumptions for undertaking a χ^2 test
forgoodness-of-fit or a test of independence in two-way tables.

- Chi-squared test for goodness-of-fit: This test determines if a sample distribution matches an expected
distribution.

- Chi-squared test of independence: This test assesses whether two categorical variables are
independent of each other.

Assumptions:

Random Sampling: Data must come from a random sample of the population.

Expected Frequency: Each expected frequency should be at least 5. If any expected frequency is less
than 5, the results may not be valid.

Categorical Data: The data should be in categorical form (e.g., counts of individuals in categories).

Example:

Goodness-of-Fit: If you have a die and want to test if it is fair, you would roll it a certain number of times,
count the outcomes for each face, and compare these counts to the expected counts (which should be
equal for a fair die).

Independence: If you want to test whether gender is related to favorite color, you would collect counts
of males and females who prefer different colors and perform a χ² test of independence.
Pearson's Chi-Squared Test is a statistical method used to determine whether there is a significant
association between two categorical variables. It compares the observed frequencies in a contingency
table to the expected frequencies under the assumption that there is no association (i.e., under the null
hypothesis).

Chi-squared statistic (x^2): A measure of the difference between the observed and expected frequencies

p-value: The probability that the observed data would occur if the null hypothesis were true. If the p-
value is less than a significance level (commonly 0.05), it suggests that there is enough evidence to reject
the null hypothesis.

- The p-value (p = 0.074, χ² = 6.923) is greater than 0.05, so there isn’t a significant association between
the number of children an individual has and if they own a pet. (since the p-value is greater than the
significance threshold)

- The null hypothesis should not be rejected.

Distinguish and calculate observed and expected values when given count data for one
or two categorical variables. scenarios.

Observed Values: The actual counts collected from your data.

Expected Values: The counts you would expect if the null hypothesis is true.

Calculating Expected Values:

For a goodness-of-fit test, the expected value for each category is calculated based on the proportion of
the total sample size.
For independence, the expected count for each cell in a contingency table is calculated using:

Example:

Observed Values: You roll a die 60 times and get: 10 ones, 15 twos, 12 threes, 8 fours, 7 fives, and 8 sixes.

Expected Values: For a fair die, you expect 10 for each side (60 rolls / 6 sides).

Calculate the degrees of freedom for a χ2 test of independence in two-way tables.

Degrees of Freedom (df): The number of values in the final calculation of a statistic that are free to vary.

Calculating Degrees of Freedom:

For a test of independence in a two-way table, the degrees of freedom are calculated as:

df=(r−1)×(c−1)

where r is the number of rows and c is the number of columns in the contingency table.

Example:

If you have a 3x4 table (3 rows and 4 columns), the degrees of freedom would be:

df=(3−1)(4−1)=2×3=6

Degrees of freedom = (r - 1) x (c - 1) r = rows c = columns

df = (2 - 1) x (4 - 1)

df = 3

Use R to calculate χ2 test statistics and standardized residuals.

Test Statistic: A measure used to evaluate the null hypothesis.

Standardized Residuals: Measures how far the observed counts deviate from the expected counts, scaled
by the standard deviation of the expected counts.
Explain the results of an R analysis and write an informative conclusion.

Null Hypothesis: The hypothesis that there is no effect or relationship.

Alternative Hypothesis: The hypothesis that there is an effect or relationship.

Example:

After running a χ² test, you get a p-value. If the p-value is less than your significance level (commonly
0.05), you reject the null hypothesis.

Example Conclusion:

“The χ² test for independence indicated a significant association between gender and favorite color, χ²(6,
N = 100) = 15.24, p < 0.01. This suggests that color preference is not independent of gender, as certain
colors are preferred by different genders in this sample.”

The null hypothesis (Ho) is a statement that there is no effect or no association between the variables
being studied. In hypothesis testing, we aim to either reject or fail to reject the null hypothesis based on
the data. The null hypothesis should not be rejected if > 0.05

Example:
- There is no association between the number of children and if an individual owns a pet.

This means that under the null hypothesis, pet ownership and having children are assumed to be
independent of each other. The goal of the study is to test if this assumption can be rejected based on
the data provided in the contingency table.

Chapter 7 - Inference for numerical data

distinguish between independent and paired (dependent) sampling scenarios.

Independent Sampling: Samples are collected independently from two or more groups. The selection of
one sample does not affect the selection of another. Example: Comparing the test scores of two different
classes (Class A and Class B).

Paired Sampling: Samples are related or matched in some way. Each subject in one sample corresponds
to a specific subject in the other sample. Example: Measuring the weight of subjects before and after a
diet program.

calculate a confidence interval for a single population mean, mean difference for
paired samples, and the difference in population means using the appropriate formula,
and using R.

Single Population Mean:


Mean Difference for Paired Samples:

Difference in Population Means:

state and check relevant conditions necessary for calculating a confidence interval for
a single population mean, mean difference for paired samples, and the difference in
two population means.
Single Population Mean:

Normality: The data should be approximately normally distributed (especially for small sample sizes).

Independence: Samples must be independent.

Paired Samples:

Normality of Differences: The distribution of the differences should be approximately normal.

Paired Design: Samples should be paired correctly.

Difference in Two Means:

Normality: Each sample should be approximately normally distributed.

Independence: Samples should be independent.

give a practical interpretation of a confidence interval, relating to the context of the


problem.

A confidence interval gives a range of values within which we can be confident the true population
parameter lies. For example, if the 95% CI for a mean difference is (2.5, 5.5), we can say we are 95%
confident that the true mean difference lies between 2.5 and 5.5.

apply the five steps of hypothesis testing for one and two means.

1. State the Hypotheses:

Null Hypothesis H0: There is no effect or difference.

Alternative Hypothesis Ha: There is an effect or difference.

2. Set the Significance Level (α): Commonly set at 0.05.

3. Calculate the Test Statistic: Using the appropriate formula depending on the test.

4. Determine the P-value: The probability of observing the data if the null hypothesis is true.

5. Make a Decision: Reject or fail to reject H0H0 based on the p-value and significance level.

find the test statistic using the appropriate formula, and using R.

Two Means:
Example:

understand and interpret p-values.

A p-value indicates the strength of evidence against the null hypothesis. A low p-value (typically < 0.05)
suggests that we reject the null hypothesis, while a high p-value suggests we do not have enough
evidence to reject it.

understand and describe the relationship between confidence intervals and hy-

pothesis testing.

A confidence interval that does not include the null hypothesis value suggests we can reject the null
hypothesis at the corresponding significance level. Conversely, if the null hypothesis value falls within the
confidence interval, we fail to reject the null hypothesis.

conduct an F -test for the difference in more than two means using analysis of

variance (ANOVA[Analysis of variance).

ANOVA: Used to compare means among three or more groups.

Example Hypotheses:

H0: All group means are equal.

Ha: At least one group mean is different.

state and check the conditions necessary to conduct the F -test.

Independence: Samples must be independent.

Normality: Each group should be approximately normally distributed.

Homogeneity of variances: The variances among the groups should be roughly equal.

complete an ANOVA table by hand, i.e., given relevant information (including number
of groups, sample sizes and Sums of Squares (SS)), determine the degrees of freedom,
calculate Mean Squares (MS) and the F -statistic.
use R to calculate individual group means.

explain the results of an R analysis and write an informative conclusion.

Present the findings of your statistical tests clearly, including test statistics, p-values, confidence
intervals, and conclusions. For example, "The ANOVA results indicated significant differences in means
across groups (F(2, 27) = 5.43, p < 0.05), suggesting that at least one group's mean differs from the
others."

Chapter 8 - Introduction to linear regression

state the simple linear regression model.

y: Dependent variable (response variable)

x: Independent variable (predictor variable)

b0: Intercept (the expected value of y when x = 0)

b1: Slope (the change in y for a one-unit change in x)

ε: Random error term (captures the variation in y not explained by x)

identify and interpret the estimates of the intercept (b0), slope (b1), squared

correlation (R2) and standard error of the slope (SEb1 ); t-test statistic; F -

statistic; and relevant p-values, from a summary output of the regression model.

When you perform a regression analysis, you'll typically receive a summary output containing several key
estimates. Here’s how to interpret them:

Intercept (b0): The value of y when x is zero. It indicates the starting point of the regression line on the y-
axis.

Slope (b1): This indicates how much y is expected to change for each unit increase in x. If b1 is positive, it
suggests a direct relationship; if negative, an inverse relationship.

Squared Correlation (R²): Represents the proportion of variance in y that can be explained by x. Values
range from 0 to 1. A value closer to 1 indicates a good fit.

Standard Error of the Slope (SEb1): Measures the average distance that the estimated values of y fall
from the regression line. A smaller standard error indicates a more precise estimate of the slope.

t-test Statistic: Used to determine if the slope is significantly different from zero. Calculated as

F-statistic: Used to determine the overall significance of the regression model. It tests whether at least
one predictor variable has a non-zero coefficient. F = (variance between groups) / (variance within
groups)

P-values: Indicate the probability of observing the test statistic under the null hypothesis. A p-value less
than a chosen significance level (e.g., 0.05) suggests rejecting the null hypothesis.

verify the value of the t-statistic using the R output.

verify the t-statistic from the summary of your linear regression model:

In the output, you will find the estimate for the slope (b1) and its corresponding standard error (SEb1).
The t-statistic will be calculated as:

test the hypothesis that there is no linear association between x and y (i.e, test H0: β1
= 0).

To test the hypothesis H0 : β1 = 0 (no linear association between x and y), you can use a t-test:

Null Hypothesis: H0 : β1 = 0

Alternative Hypothesis: Ha: β1 ≠ 0

If the p-value associated with the t-statistic is less than 0.05, you reject the null hypothesis, indicating a
significant linear relationship between x and y.
check the necessary conditions for simple linear regression.

Before concluding from the regression results, check these assumptions:

Linearity: The relationship between x and y should be linear. You can visualize this with a scatter plot.

Independence: Observations should be independent of each other. This is generally ensured by proper
experimental design.

Homoscedasticity: The residuals (errors) should have constant variance at all levels of x. Plot residuals
against fitted values to check. Constant variance means that the spread or dispersion of the residuals
does not increase or decrease as the values of the independent variable change.

Normality: The residuals should be approximately normally distributed. This can be checked using a Q-Q
plot or histogram of residuals.

calculate using the relevant formula and using R a confidence interval for the

slope of the least squares regression line.

To calculate a confidence interval for the slope using R:

The confidence interval provides a range within which we expect the true slope to lie with a certain level
of confidence (e.g., 95%).

give an informative interpretation of the regression analysis, relating it to the context


of the question.

In context If you found a significant positive slope (e.g., b1 = 2.5, p < 0.05), you could say, "For every one-
unit increase in x, y increases by 2.5 units on average, indicating a strong positive linear relationship."

Discuss the practical implications of this relationship. For example, if x represents study hours and y
represents test scores, you might conclude that increasing study time positively affects performance.

Dependent Variable (Response Variable): The outcome variable you are trying to predict. Example: Test
scores based on study hours. (y)

Independent Variable (Predictor Variable): The variable you are manipulating or observing for its effect
on the dependent variable. Example: Hours spent studying. (x)

Intercept (b0): The expected value of the dependent variable when the independent variable is zero.
Example: The expected test score for a student who studies 0 hours.

Slope (b1): Indicates how much the dependent variable changes with a one-unit change in the
independent variable. Example: A slope of 2 means for each additional hour of study, the test score
increases by 2 points.

R² (Squared Correlation): Indicates the proportion of variance in the dependent variable that can be
explained by the independent variable. Example: An R² of 0.80 means 80% of the variance in test scores
is explained by study hours.

Standard Error (SE): Indicates the accuracy of the slope estimate. Example: A small standard error
suggests a precise estimate of how study hours affect scores.

t-test: A statistical test used to determine if a coefficient (like slope) is significantly different from zero.
Example: A t-statistic of 4.5 suggests that the slope is significantly different from zero.

F-statistic: Tests the overall significance of the regression model. Example: An F-statistic of 15 suggests
that at least one predictor significantly affects the dependent variable.

P-value: The probability that the observed data would occur if the null hypothesis were true. Example: A
p-value of 0.01 indicates strong evidence against the null hypothesis.
Random
Graph Distribution
Plots/graphs
Histogram:

Residual plot:
BoxPlot:

Normal Q-Q plots:


scatter plot:

You might also like