0% found this document useful (0 votes)
8 views28 pages

Statistics

The document provides an overview of key statistical concepts, including the distinction between population and sample, the importance of representativeness, and the role of estimators. It discusses various statistical methods such as hypothesis testing, confidence intervals, and measures of central tendency like mean, median, and mode, along with their applications. Additionally, it highlights common errors in statistical interpretation and the significance of sample size in research accuracy.

Uploaded by

wer.przysiezna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

Statistics

The document provides an overview of key statistical concepts, including the distinction between population and sample, the importance of representativeness, and the role of estimators. It discusses various statistical methods such as hypothesis testing, confidence intervals, and measures of central tendency like mean, median, and mode, along with their applications. Additionally, it highlights common errors in statistical interpretation and the significance of sample size in research accuracy.

Uploaded by

wer.przysiezna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Statistics

Weronika Przysiężna, Mikołaj Straszewicz, Piotr Wilkowski

26.03.2025

1 / 28
Population vs sample

Population it is a whole group that is of our interest (ex. all trees in a


forest, all inhabitants of a country).
Sample is a subset of a population which has representative meaning for
the whole set. The size of the sample is crucial for the accuracy and
reliability of further research (ex. 50 trees from a forest, 1000 citizens of
a country).

2 / 28
Representativeness of the Group

Representativeness of a sample means compatibility of features with


characteristics of the population from which the sample was drawn.
For example, if we know that the mean height of trees in a forest is 35
meters, we can expect that the mean height of trees in the sample will
be approximately 35 meters.
Properties for representatives samples:
randomness - every unit in the population has an equal probability
of being included in the sample,
quantity - larger sample sizes lead to more accurate and reliable
results.

3 / 28
Estimators

A parameter represents a numerical characteristic of a population, such


as the mean, median, variance, or standard deviation, and it remains
constant for that population unless the population itself changes.
An estimator is a function which is used to approximate a value of
parameter from a sample. Good estimator is:
Consistent - increasing the sample size increases the probability of
the estimator being close to the population parameter,
Efficient - where estimator has minimal error,
Unbiased - the expected value of the estimator equals the true
parameter.

4 / 28
Measurement error

A measurement error shows the difference between approximation and


real result.
Standard error is a theoretical term that represents the approximation
error of a parameter if the research is repeated multiple times. In
different words this value reflects the variation in a measurement due to
differences between samples.

5 / 28
Optimal sample size
The optimal sample size is typically 30. As this number is both easy
to collect and sufficiently large to ensure accuracy and reliability.
However, the rule of 30 does not always apply, as it depends on the
specific characteristics of the population being studied.

Figure: Standard error and sample size relation


6 / 28
Histogram
An easy and graphical way to present research results is to show them in
a histogram. This method allows us to identify outcomes that
significantly deviate from others.

Figure: Exemplary histogram

7 / 28
Normal distribution

A normal distribution or Gaussian distribution is a type of


continuous probability distribution for a real-valued random variable.
The general form of its probability density function is
(x−µ)2
f (x) = √ 12 e − 2σ2 where µ - mean and σ - standard deviation. The
σ 2π
normal distribution is easy to analyze due to its mathematical simplicity
and predictable properties, making it an extremely practical tool in
statistics and data analysis.

8 / 28
The Central Limit Theorem
The Central Limit Theorem states: Regardless of the shape of the
population distribution, with random and independent measurements
taken from it, the distribution of sample means will approach a normal
distribution, - and the more observations we collect, the closer it gets to
normality.

Figure: Sampling distribution and sample size


9 / 28
Interval estimation

Calculating probabilities for specific values of the standard


Standardizing the normal probability distribution makes it possible
to accurately estimate the level of "trust," known in statistics as
the confidence level (typically set at 95%).
Based on the chosen confidence level (e.g., 95%), a specific range
within the standard normal distribution is called the confidence
interval. In a normal distribution, 95% of the values fall within
±1.96 standard deviations from the mean.

10 / 28
Null Hypothesis Significance Testing (NHST)

Null Hypothesis Significance Testing


In this approach, verification is not about determining the probability
that the alternative hypothesis is true — the one claiming that an effect
exists. Instead, it is about rejecting the null hypothesis, which states
that there is no effect.

research hypotheses often assume very specific effects (directional


hypotheses),
statistical hypotheses (typically for two-tailed tests) only describe
general relationships (non-directional hypotheses).

11 / 28
Null and Alternative Hypotheses

Null Hypothesis (H0 ): No effect or difference.


Alternative Hypothesis (HA ): There is an effect or difference.

H0 : µ 1 = µ 2
HA : µ1 ̸= µ2

The Greek symbols used here (µ1 , µ2 ) indicate that we are referring to
population means, not sample means.
According to the NHST approach, the null hypothesis is superior to the
alternative hypothesis because it is the one that is actually tested.
Accepting (more precisely: having no grounds to reject) or
rejecting the null hypothesis is not evidence of the non-existence
or existence of a particular effect or relationship!

12 / 28
Type I and Type II Errors
Type I Error (α): Rejecting H0 when it is true (False Positive).
Type II Error (β): Failing to reject H0 when HA is true (False Negative).

13 / 28
Significance Level (α) and Confidence Level
Significance Level (α): Probability of a Type I error, commonly
set at 5% (sometimes 1% or 10%).

14 / 28
Significance Level (α) and Confidence Level

Confidence Level (α − 1): Percentage of confidence intervals


(e.g., for a 95% level, there are 95%) that, estimated from an
infinite number of repetitions of a given study, contain the true
value of the estimated parameter.

15 / 28
Confidence Intervals
Confidence intervals (CI) are used to illustrate how reliable the estimator
obtained in the study is and are related to the standard error.

16 / 28
Power of the Test
it is the probability of avoiding a Type II error
the greater the test’s power, the better its ability to reject the null
hypothesis (if it is indeed false!)
the sample should be chosen to achieve a power of at least 80%
power = 1 − β

17 / 28
Test Statistic
calculated from sample data; used to decide if H0 should be rejected
when running a statistical test, we calculate the probability of
obtaining a specific value of the test statistic based on our sample
the smaller the probability (p-value) of obtaining your result
under H0 , the stronger the indication that your result is
significant and that the null hypothesis might be false

18 / 28
p-Value and Critical Value

If the probability of getting a test statistic at least as extreme as the one


calculated from the sample is less than 5% (p < 0.05), we reject the null
hypothesis.
p-value: probability of getting data as extreme as observed,
assuming H0 is true,
critical value: the point beyond which we consider the result
meaningful, not due to chance.

19 / 28
p-Value and Critical Value
Example: If p = 0.03 and α = 0.05, we reject H0 .

20 / 28
Common mistakes

Common mistakes in reporting and interpreting statistical results:

confusing p-value with significance level α,


wrong interpretation of the p-value,
ignoring sample size effects,
misinterpreting non-significant results,
low-powered tests leading to overestimated effects,
publication bias / winner’s curse.

21 / 28
Statistical Hypothesis Testing Process

22 / 28
Mean (Average)

Definition: The mean, or average, is the sum of all data values divided
by the number of values.

Formula:
n
1X
x̄ = xi
n
i=1

Example:
For the data: 5, 7, 8, 10, 10
Mean = 5+7+8+10+10
5 =8
Use: Useful for understanding the central tendency of a dataset.

23 / 28
Median

Definition: The median is the middle value of an ordered dataset. If


there is an even number of values, it is the average of the two middle
ones.

Example:
Data: 3, 5, 7, 8, 10
Median = 7
Data: 3, 5, 7, 8
Median = 5+72 =6
Use: Useful when data contains outliers.

24 / 28
Quantiles

Definition: Quantiles divide a dataset into equal-sized intervals.


Common quantiles include quartiles (4 parts), deciles (10 parts), and
percentiles (100 parts).

Example:
The first quartile (Q1) is the 25th percentile, the median (Q2) is the
50th percentile, and the third quartile (Q3) is the 75th percentile.
Use: Helps describe the distribution and spread of data.

25 / 28
Mode (Dominant Value)

Definition: The mode is the value that appears most frequently in a


dataset. A dataset can have more than one mode.

Example:
Data: 3, 4, 4, 5, 6, 6, 6, 7
Mode = 6
Use: Useful for categorical data or to identify common values.

26 / 28
Variance

Definition: Variance measures how far data values are spread out from
the mean.

Formula:
n
1X
σ2 = (xi − x̄)2
n
i=1

Example:
If data = 2, 4, 4, 4, 5, 5, 7, 9 and mean = 5
Then variance = average of squared differences from 5
Use: Important in probability and statistical modeling.

27 / 28
Standard Deviation

Definition: Standard deviation is the square root of the variance. It


indicates how much the values typically differ from the mean.

Formula: v
u n
u1 X
σ=t (xi − x̄)2
n
i=1

Use: Easier to interpret than variance because it has the same unit as
the data. Useful for comparing variability.

28 / 28

You might also like