0% found this document useful (0 votes)

3 views

Statistics Course Notes

Uploaded by

sakariyads1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Statistics Course Notes

Uploaded by

sakariyads1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

1

Iliya Valchanov

Statistics
2

Table of Contents
Abstract .................................................................................................................................... 3
1. Descriptive Statistics ......................................................................................................... 4
1.1 Types of Data................................................................................................................. 4
1.2 Levels of Measurment ................................................................................................ 5
1.3 Graphs and Tables that Represent Categorical Variables ........................................ 6
1.3.1 Excel Formulas ..................................................................................................... 7
1.3.2 Pareto Diagrams in Excel .................................................................................... 8
1.4 Graphs and Tables that Represent Numerical Variables.................................. ....... 9
1.4.1 Frequency Distriubution Table and Histogram .............................................. 10

1.5 Graphs and Tables for Relationships Between Variables.....................................11

1.5.1 Cross Tables ........................................................................................................... 11
1.5.2 Scatter Plots .........................................................................................................12
1.6 Mean, Median and Mode ........................................................................................13
1.7 Skewness .....................................................................................................................14
1.8 Variance and Standard Deviation ............................................................................ 15
1.9 Covariance and Correlation.......................................................................................16
2. Inferential Statistics......................................................................................................... 17
2.1 Distributions ................................................................................................................ 17
2.1.1 The Normal Distribution......................................................................................18
2.1.2 The Standard Normal Distribution .................................................................21
2.2 The Central Limit Theorem ....................................................................................... 22
2.3 Estimators and Estimates ....................................................................................... 23
2.4 Confidence Intervals and the Margin of Error ....................................................... 24
2.5 Student’s T Distribution............................................................................................. 25
2.6 Formulas for Confidence Intervals........................................................................... 26
3. Hypothesis Testing............................................................................................................ 27
3.1 The Scientific Method ............................................................................................... 27
3.2 Hypotheses .............................................................................................................. 28
3.3 Null Hypotheses ......................................................................................................... 29
3.4 Decisions You Can Take.............................................................................................. 30
3.5 Level of Significance and Types of Tests.....................................................................31
3.6 Statistical errors (Type I Error and Type II Error).......................................................32
3.7 P-value...........................................................................................................................33
3.8 Formulae for Hypothesis Testing................................................................................ 34
3

Abstract

Statistics is an essential component in the ever-expanding field of data science

playing an invaluable role in the making of informed business decisions. Statistical
functions are applied on large sets of data to draw conclusions, make predictions, and
minimize loss.
Therefore, if you want to enjoy a successful data science career, you need to have a
solid grasp of statistical core concepts and basics, all covered in the statistics course
notes. We start off with descriptive statistics, diving into all the associated graphs and
tables for numerical and descriptive data. Then we take a look at inferential statistics,
the different types of distributions, confidence intervals and respective formulas.
We finish off with the process of hypotheses testing, going into the types, examples
formulas and the p-value.

Keywords: statistics, numerical data, categorical data, p-value, normal distribution,

confidence interval, hypotheses testing
4

1. Descriptive Statistics

1.1 Types Of Data

Types of data

Categorical Numerical

Categorical data represents groups or

categories.

Examples: Discrete Continuous

1. Car brands: Audi, BMW and

Mercedes. Numerical data represents numbers.
It is divided into two groups: discrete
2. Answers to yes/no questions: yes and continuous. Discrete data can
and no be usually counted in a finite matter,
while continuous is infinite and
impossible to count.

Examples:
Discrete: # children you want to have,
SAT score Continuous: weight, height
5

1.2 Levels of Measurement

Levels of measurement

Qualitative Quantitative

Nominal Ordinal Interval Ratio

There are two qualitative levels: There are two quantitative levels:
nominal and ordinal. The nominal interval and ratio. They both represent
level represents categories that “numbers”, however, ratios have a true
cannot be put in any order, while zero, while intervals don’t.
ordinal represents categories that can
be ordered. Examples:
Interval: degrees Celsius and
Examples: Nominal: four seasons Fahrenheit Ratio: degrees Kelvin,
(winter, spring, summer, autumn) length
Ordinal: rating your meal (disgusting,
unappetizing, neutral, tasty, and
delicious
6

1.3 Graphs and Tables that Represent Categorical Variables

Frequency
distribution Bar charts Pie charts Pareto
tables Diagrams

Sales
Frequency 150
124

Frequency
Audi
100
BMW 98

Mercedes 113 50
335 124 98 113
Total 0
Audi BMW Mercedes

Frequency distribution tables show Bar charts are very common. Each bar
the category and its corresponding represents a category. On the y-axis
absolute frequency. we have the absolute frequency.

Sales
Mercedes Audi
150
Frequency

100%
34% 37%
80%
100
60%
50 40%
20%
124 113 98
BMW 0 0%
29% Audi BMW Mercedes

Pie charts are used when we want to The Pareto diagram is a special type
see the share of an item as a part of of bar chart where the categories
the total. Market share is almost are shown in descending order of
always represented with a pie chart. frequency, and a separate curve
shows the cumulative frequency.
7

1.3.1 Excel formulas

Frequency
distribution Bar charts Pie charts Pareto
tables Diagrams

Sales
Frequency 150
124

Frequency
Audi
100
BMW 98

Mercedes 113 50
335 124 98 113
Total 0
Audi BMW Mercedes

In Excel, we can either hard code the Bar charts are also called clustered
frequencies or count them with a column charts in Excel. Choose your
count function. This will come up later data, Insert
on. -> Charts -> Clustered column or
Total formula: =SUM() Bar chart.

Sales
Mercedes Audi
Frequency

150 100%
34% 37%
80%
100
60%
50 40%
20%
0 124 113 98
BMW 0%
29% Audi BMW Mercedes

Pie charts are created in the following Next slide

way:
Choose your data, Insert -
>Charts -> Pie chart
8

1.3.2 Pareto Diagrams in Exce l

Sales
150 100%
Frequency

90%
80%
100 70%
60%
50%
40%
50
30%
20%
124 113 98 10%
0 0%
Audi BMW Mercedes

Creating Pareto diagrams in Excel:

1 Order the data in your frequency distribution table in descending order.

. Create a bar chart.
2 Add a column in your frequency distribution table that measures the cumulative
. frequency.
3
4 Select the plot area of the chart in Excel and Right click.
. Choose Select series.
5 Click Add
. Series name doesn’t matter. You can put ‘Line’
6 For Series values choose the cells that refer to the cumulative frequency.
. Click OK. You should see two side-by-side bars.
7
10.Select the plot area of the chart and Right click.
.11.Choose Change Chart Type.
8
12.Select Combo.
.13.Choose the type of representation from the dropdown list. Your initial
9 categories should be ‘Clustered Column’. Change the second series, that you
. called ‘Line’, to ‘Line’.

14. Done.
9

1.4 Graphs and tables that represent numerical variables

1.4.1 Frequency distribution table and histogram

Interval start Interval end Frequency Relative frequency

1 21 2 0.10
21 41 4 0.20
41 61 3 0.15
61 81 6 0.30
81 101 5 0.25

Frequency distribution tables for numerical variables are different than the ones for
categorical. Usually, they are divided into intervals of equal (or unequal) length. The
tables show the interval, the absolute frequency and sometimes it is useful to also
include the relative (and cumulative) frequencies.

The interval width is calculated using the following formula:

𝐿𝑎𝑟𝑔𝑒𝑠𝑡𝑛𝑢𝑚𝑏𝑒𝑟 − 𝑠𝑚𝑎𝑙 𝑙 𝑒𝑠𝑡𝑛𝑢𝑚𝑏𝑒𝑟

𝐼𝑛𝑡𝑒𝑟𝑎𝑙 𝑤𝑖𝑑𝑡ℎ = 𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑠
Creating the frequency distribution table in Excel:

1. Decide on the number of intervals you would like to use.

2. Find the interval width (using a the formula above).
3. Start your 1st interval at the lowest value in your dataset.
4. Finish your 1st interval at the lowest value + the interval width. ( = start_interval_cell
+ interval_width_cell )
5. Start your 2nd interval where the 1st stops (that’s a formula as well - just make the
starting cell of interval 2 = the ending of interval 1)
6. Continue in this way until you have created the desired number of intervals.
7. Count the absolute frequencies using the following COUNTIF
formula:=COUNTIF(dataset_range,”>=“&interval start) -COUNTIF(dataset_
range,”>“&interval end).
8. In order to calculate the relative frequencies, use the following formula: =
absolute_frequency_cell / number_of_observations
9. In order to calculate the cumulative frequencies:

I. The first cumulative frequency is equal to the relative frequency

II. Each consequitive cumulative frequency = previous cumulative frequency the
respective relative frequency

Note that all formulas could be found in the lesson Excel files and the solutions of the
exercises provided with each lesson.
10

Histogram
7
6
5
4
3
2
1
0
[1,21] (21,41) (41,61] (61,81] (81,101]

Creating a histogram in Excel:

Choose your data
1. Insert -> Charts -> Histogram
2. To change the number of bins (intervals):
3.
1. Select the x-axis
2. Click Chart Tools -> Format -> Axis options
3. You can select the bin width (interval width), number of bins, etc.

Histogram relative frequency

0.30
0.25
0.20
0.15
0.10

[1,21] (21,41) (41,61] (61,81] (81,101]

Histograms are the one of the most common ways to represent numerical data.
Each bar has width equal to the width of the interval. The bars are touching as there is
continuation between intervals: where one ends -> the other begins.
11

1.5 Graphs and Tables for Relationships Between Variables.

1.5.1 Cross Tables
Type of investment / Investor Investor A Investor B Investor C Total
Stoks 96 185 39 320
Bonds 181 388 29 213
real Estate 88 152 142 382
Total 365 340 210 915

Type of investment / Investor Investor A Investor B Investor C Total

Stoks 0.10 0.20 0.04 0.35
Bonds 0.20 0.00 0.03 0.23
real Estate 0.10 0.17 0.16 0.42
Total 0.40 0.37 0.23 1.00

Cross tables (or contingency tables) are used to represent categorical variables.
One set of categories is labeling the rows and another is labeling the columns. We
then fill in the table with the applicable data. It is a good idea to calculate the totals.
Sometimes, these tables are constructed with the relative frequencies as shown in the
table below.

A common way to represent the data from a cross table is by using a side-by-side bar
chart

Creating a side-by-side chart in Excel:

1. Choose your data

2. Insert -> Charts -> Clustered Column

Selecting more than one series (groups of data) will automatically prompt Excel to
create a side-by-side bar (column) chart.

Side-by-side bar chart

200
180
160
140
120
100
80
60
40
20

Investor A Investor B Investor C

Stocks Bonds Real Estate

1.5.2 Scatter Plots

800
700
600
500
400
300
200

100

0
0 100 200 300 400 500 600 700 800

When we want to represent two numerical variables on the same graph, we usually use a scatter
plot. Scatter plots are useful especially later on, when we talk about regression analysis, as they
help us detect patterns (linearity, homoscedasticity).
Scatter plots usually represent lots and lots of data. Typically, we are not interested in single
observations, but rather in the structure of the dataset.

Creating a scatter plot in Excel:

1. Choose the two datasets you want to plot.

2. Insert -> Charts -> Scatter

25.5
25
24.5

24
23.5
22.5
22
21.5
0 20 40 60 80 100 120 140 160 180

A scatter plot that looks in the following way (down) represents data that doesn’t have a
pattern. Completely vertical ‘forms’ show no association.

Conversely, the plot above shows a linear pattern, meaning that the observations move
together.
13

1.6 Mean, Median, Mode

Mean Median Mode

The mean is the The median is the The mode is the value
most widely spread midpoint of the that occurs most often.
measure of central ordered dataset. It is A dataset can have
tendency. It is the not as popular as the 0 modes, 1 mode or
simple average of the mean, but is often multiple modes.
dataset. used in academia
and data science. The mode is
Note: easily affected
That is since it is not calculated simply by
by outliers
affected by outliers. finding the value
The formula to with the highest
calculate the mean is: frequency.
In an ordered dataset,
the median is the In Excel, the mode
number at position is calculated by:
=MODE.SNGL() ->
returns one mode
If this position is not =MODE.MULT() ->
a whole number, it, returns an array with
the median is the the modes. It is used
In Excel, the mean simple average of
is calculated by: when we have more
the two numbers at than 1 mode.
=AVERAGE() positions closest to
the calculated value.

In Excel, the
median is calculated
by:

=MEDIAN()
14

1.7 Skewness

Median Mean
Mode

Calculating skewness in Excel:

=SKEW()

Skewness is a measure of asymmetry that indicates whether the observations in a

dataset are concentrated on one side.
Right (positive) skewness looks like the one in the graph. It means that the outliers are
to the right (long tail to the right).
Left (negative) skewness means that the outliers are to the left.
Usually, you will use software to calculate skewness.

Formula to calculate skewness:

1.8 Variance and Standard Deviation

Point 1 Calculating variance in Excel:

Point 4

Sample variance:
=VAR.S()
Population variance:
Point 2 Mean Point 5 =VAR.P()
Sample standard deviation:
= STDEV.S()
Population standard deviation:
Point 3 Point 6 =STDEV.P()

Variance and standard deviation measure the dispersion of a

set of data points around its mean value.

There are different formulas for population and sample variance & standard
deviation. This is due to the fact that the sample formulas are the unbiased estimators
of the population formulas. More on the mathematics behind it.

Sample variance formula:

Population variance formula:

Sample standard deviation formula:

Population standard deviation formula:

1.9 Covariance and Correlation

Covariance Correlation

Covariance is a measure of the joint Correlation is a measure of the joint

variability of two variables. variability of two variables. Unlike
• A positive covariance means that covariance, correlation could be
thought of as a standardized
the two variables move together. measure. It takes on values between
• A covariance of 0 means that the -1 and 1, thus it is easy for us to
two variables are independent. interpret the result.
• A negative covariance means
that the two variables move in • A correlation of 1, known as
opposite directions. perfect positive correlation, means
that one variable is perfectly
Covariance can take on values from explained by the other.
-∞ to +∞ . This is a problem as it is • A correlation of 0 means that the
very hard to put such numbers into • variables are independent.
perspective. A correlation of -1, known as
Sample covariance formula: perfect negative correlation,
means that one variable is
explaining the other one
perfectly, but they move in
opposite directions.

Population covariance formula: Sample correlation formula:

In Excel, the covariance is Population correlation formula:

calculated by:
Sample covariance:
=COVARIANCE.S()
Population covariance:
=COVARIANCE.P() In Excel, correlation is calculated
by:
=CORREL()
17

2. Inferential Statistics

2.1 Distributions

Definition Graphical representation

In statistics, when we talk about It is a common mistake to believe

distributions we usually mean that the distribution is the graph.
probability distributions. In fact the distribution is the ‘rule’
that determines how values are
Definition (informal): A distribution positioned in relation to each other.
is a function that shows the possible
values for a variable and how often Very often, we use a graph
they occur. to visualize the data. Since
Definition (Wikipedia): In different distributions have a
particular graphical representation,
probability theory and statistics,
statisticians like to plot them.
a probability distribution is a
mathematical function that, stated Examples:
in simple terms, can be thought
Uniform distribution
of as providing the probabilities
of occurrence of different possible
outcomes in an experiment.
Examples: Normal distribution,
Student’s T distribution, Poisson Binomial distribution
distribution, Uniform distribution,
Binomial distribution

Normal distribution

Student’s T distribution
18

2.1.1 The Normal Distribution

The Normal distribution is also known as Gaussian distribution or the Bell curve. It is
one of the most common distributions due to the following reasons:
• It approximates a wide variety of random variables
• Distributions of sample means with large enough samples sizes could be

approximated to normal
• All computable statistics are elegant
• Heavily used in regression analysis
• Good track record

𝑁~(𝜇 , 𝜎2 )

N stands for normal;

~ stands for a distribution;
𝜇 is the me an;
𝜎2 is the variance .
Examples:
• Biology. Most biological measures are normally distributed, such as: height;

length of arms, legs, nails; blood pressure; thickness of tree barks, etc.
• IQ tests
• Stock market information
19

Controlling for the standard deviation

σ=140 σ = 140 σ = 140

0 147 297 447 597 747 897 1047 1197 1347

Origin
μ = 470 μ = 743 μ = 960

Keeping the standard deviation constant, the graph of a normal distribution with: • a
smaller mean would look in the same way, but be situated to the left (in gray) • a
larger mean would look in the same way, but be situated to the right (in red)
20

Controlling for the mean

σ = 70

σ = 140

σ = 210

0 147 297 447 597 747 897 1047 1197 1347

Origin
μ = 743 μ = 743 μ = 743

Keeping the mean constant, a normal distribution with:

• a smaller standard deviation would be situated in the same spot, but have a higher
peak and thinner tails (in red)
• a larger standard deviation would be situated in the same spot, but have a lower
peak and fatter tails (in gray)
21

2.1.2 The Standard Normal Distribution

The Standard Normal distribution Why standardize?

is a particular case of the Normal
distribution. It has a mean of 0 and a Standardization allows us to:
standard deviation of 1. • compare different normally
Every Normal distribution can distributed datasets
be ‘standardized’ using the
standardization formula: • detect normality
• detect outliers
• create confidence intervals
• test hypotheses
• perform regression analysis

A variable following the Standard

Normal distribution is denoted with
the letter z.

(0, 1)
Rationale of the formula for standardization:
We want to transform a random variable from N~ μ, σ² to N~(0,1).
Subtracting the mean from all observations would cause a transformation from N~ μ,σ²
to N~ 0, σ² , moving the graph to the origin.
Subsequently, dividing all observations by the standard deviation would cause a
transformation from N~ 0, σ² to N~ 0,1, standardizing the peak and the tails of the
graph.
22

2.2 The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the greatest statistical insights. It states
that no matter the underlying distribution of the dataset, the sampling distribution
of the means would approximate a normal distribution. Moreover, the mean of the
sampling distribution would be equal to the mean of the original distribution and the
variance would be n times smaller, where n is the size of the samples. The CLT applies
whenever we have a sum or an average of many variables (e.g. sum of rolled numbers
when rolling dice).

The theorem Why is it useful? Where can we see it?

• No matter the The CLT allows us Since many concepts

distribution to assume normality and events are a
The distribution of for many different sum or an average
• variables. That of different effects,
𝑥1, 𝑥2 , 𝑥3, 𝑥4, … , 𝑥n is very useful for CLT applies and we
confidence intervals, observe normality all
hypothesis testing, the time. For example,
and regression in regression analysis,
analysis. In fact, the the dependent
• The more samples, Normal distribution variable is explained
the closer to is so predominantly through the sum of
Normal observed around error terms.
us due to the fact
• The bigger the that following the
samples, the closer CLT, many variables
to Normal converge to Normal.
Click here for a CLT
simulator.
23

2.3 Estimators and Estimates

Estimators Estimates

Broadly, an estimator is a An estimate is the output that you get

mathematical function that from the estimator (when you apply
approximates a population parameter the formula). There are two types
depending only on sample of estimates: point estimates and
information. confidence interval estimates.
Examples of estimators and the
corresponding parameters:

Term Estimator Parameter Point Confidence

Mean μ estimates intervals
Variance
𝒔 𝝈
Correlation 𝟐 𝟐
A single value. An interval.
r ρ Examples: Examples:
• 1 • (1,5)
Estimators have two important • 5 • ( 12 , 33)
properties: • 122.6 ( 221.78 ,
• 7 0.32 • 745.66)
• Bias • ( - 0.71 , 0.11)
The expected value of an unbiased
estimator is the population Confidence intervals are much more
parameter. The bias in this case
precise than point estimates. That is
is 0. If the expected value of an
why they are preferred when making
estimator is (parameter + b), then
the bias is b. inferences.
Efficiency
• The most efficient estimator is the
one with the smallest variance.
24

2.4 Confidence Intervals and the Margin of Error

Interval start Point estimate Interval end

Definition: A confidence interval is an interval within which we are confident (with a

certain percentage of confidence) the population parameter will fall.
We build the confidence interval around the point estimate.
(1-α) is the level of confidence. We are (1-α)*100% confident that the population

General formula:
where ME is the margin of error.

Term Effect on width of CI

(1-α) ↑ ↑
𝝈↑
↑
n↑ ↓

ME
√
= re liability factor ∗ =𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑠𝑎𝑚𝑝𝑙 𝑒 𝑠𝑖𝑧𝑒
25

2.5 Student’s T Distribution

The Student’s T distribution is used All else equal, the Student’s T

predominantly for creating confidence distribution has fatter tails than the
intervals and testing hypotheses Normal distribution and a lower
with normally distributed populations peak. This is to reflect the higher level
when the sample sizes are small. It of uncertainty, caused by the small
is particularly useful when we don’t sample size.
have enough information or it is too
costly to obtain it.

Student’s T
distribution

Normal distribution

A random variable following the t-distribution is denoted 𝑡υ,α, where υ are the degrees
of freedom.
We can obtain the student’s T distribution for a variable with a Normally distributed
population using the formula:
26

2.6 Formulas for Confidence Intervals

is r
ns

e n

at fo
tic
nc tio
tio

tic

st la
nc
ria la

pl
la

st u
is

te orm
va opu

ria
pu

at
m
Sa

Va
po

F
#

One known - z
2

One - t
unknown
2

- t 2
Two dependent
difference
𝑠

Two z
known independent

unknown,
Two assumed t
independent
equal

unknown,
assumed
Two
different
independent t
27

3. Hypothesis Testing

3.1 Scientific method

The ‘scientific method’ is a procedure that has characterized natural science since the
17th century. It consists in systematic observation, measurement, experiment, and
the formulation, testing and modification of hypotheses.
Since then we’ve evolved to the point where most people and especially
professionals realize that pure observation can be deceiving. Therefore, business
decisions are increasingly driven by data. That’s also the purpose of data science.

While we don’t ‘name’ the scientific method in the videos, that’s the underlying
idea. There are several steps you would follow to reach a data-driven decision
(pictured).

Steps in data-driven decision making

1 2 3 4
Formulate a Find the right Execute the Make a
hipothesis rest test decision
28

3.2 Hypotheses

It is a supposition or proposed
A hypothesis is explanation made on the basis of
“an idea that can be tested” limited evidence as a starting point for
further investigation.

Null hypothesis (H 0) Alternative hypothesis (H 1 or HA)

The null hypothesis is the The alternative hypothesis is

hypothesis to be tested. the change or innovation that is
It is the status-quo. Everything which contesting the status-quo.
was believed until now that we are Usually the alternative is our own
contesting with our test.
opinion. The idea is the following:
The concept of the null is similar If the null is the status-quo (i.e., what
to: innocent until proven guilty. We is generally believed), then the act
assume innocence until we have of performing a test, shows we have
enough evidence to prove that a doubts about the truthfulness of
suspect is guilty. the null. More often than not the
researcher’s opinion is contained in
the alternative hypothesis.
29

3.3 Null Hypotheses

After a discussion in the Q&A, we have

A hypothesis is decided to include further clarifications
“an idea that can be tested” regarding the null and alternative hypotheses.

Now note that the statement in the

As per the above logic, in the question is NOT true.
video tutorial about the salary Instructor’s answer (with some
of the data scientist, the null adjustments)
hypothesis should have been: ‘I see why you would ask this question,
Data Scientists do not make an as I asked the same one right after
average of $113,000. I was introduced to hypothesis
In the second example the null testing. In statistics, the null
Hypothesis should have been: hypothesis is the statement we are
The average salary should be trying to reject. Think of it as the
less than or equal to $125,000. ‘status-quo’. The alternative, therefore,
Please explain further. is the change or innovation.
Example 1: So, for the data scientist
salary example, the null would be:
the mean data scientist salary is
$113,000. Then we will try to reject
the null with a statistical test. So,
Student’s usually, your personal opinion (e.g.
question data scientists don’t earn exactly that
much) is the alternative hypothesis.
Example 2: Our friend Paul told us
that the mean salary is >=$125,000
(status- quo, null). Our opinion is that
he may be wrong, so we are testing
that. Therefore, the alternative is: the
mean data scientist salary is lower
than $125,000.
It truly is counter-intuitive in the
beginning, but later on, when you
start doing the exercises, you will
understand the mechanics.’
30

3.4 Decisions You Can Take

When testing, there are two decisions that can be made: to accept the null hypothesis
or to reject the null hypothesis.
To accept the null means that there isn’t enough data to support the change or the
innovation brought by the alternative. To reject the null means that there is enough
statistical evidence that the status-quo is not representative of the truth.

accept
Rejection region Rejection region

0
reject reject

Given a two-tailed test:

Graphically, the tails of the distribution show when we reject the null hypothesis
(‘rejection region’).
Everything which remains in the middle is the ‘acceptance region’.

The rationale is: if the observed statistic is too far away from 0 (depending on the
significance level), we reject the null. Otherwise, we accept it

Different ways of reporting the result:

Accept Reject
At x% significance, we accept the null At x% significance, we reject the null
hypothesis hypothesis
At x% significance, A is not At x% significance, A is significantly
significantly different from B different from B
At x% significance, there is not At x% significance, there is enough
enough statistical evidence that… At statistical evidence… At x%
x% significance, we cannot reject the significance, we cannot say that
null hypothesis *restate the null*
31

3.5 Level of Significance and Types of Tests

The probability of rejecting a null hypothesis

Level of significance that is true; the probability of making this
(α)
error.

Common significance levels 0.10 0.05 0.01

Two-sided (two-tailed) test

Used when the null contains an equality (=) or an inequality sign (≠)

α/2 = 0.05
Rejection region Rejection region
accept α / 2 = 0.02 5
α / 2 = 0.02 5

One-sided (one-tailed) test

Used when the null doesn’t contain equality or inequality sign (<,>,≤,≥)

Rejection region

α = 0.05
32

3.6 Statistical Errors (Type I Error and Type II Error)

In general, there are two types of errors we can make while testing: Type I error (False
positive) and Type II Error (False negative).

Statisticians summarize the errors in the following table:

Ho: Status quo

The truth

Ho is true Ho is false

Type II error
Accept
(False negative)
Ho (Status quo)
Type I error (False
Reject
positive)

Here’s the table with the example from the lesson:

Ho: She doesn’t like you

The truth

She doesn’t like

She likes you
you

Accept Type II error

Ho (Status quo)
(do nothing) (False negative)
She doesn’t like you
(you should not
Reject Type I error (False
invite her)
(invite her) positive)

The probability of committing Type I error (False positive) is equal to the significance
level (α).
The probability of committing Type II error (False negative) is equal to the beta (β).
If you want to find out more about statistical errors, just follow this link for an article
written by your instructor.
33

3.7 P-value

The p-value is the smallest level of significance at which we

P-value can still reject the null hypothesis, given the observed sample
statistic

When we are testing a hypothesis,

we always strive for those ‘three
zeros after the dot’. This indicates
0.000 that we reject the null at all
significance levels.
Notable p-values 0.05 is often the ‘cut-off line’. If
our p-value is higher than 0.05 we
would normally accept the null
0.05 hypothesis (equivalent to testing
at 5% significance level). If the
p-value is lower than 0.05 we
would reject the null.

Where and how are p-values used?

• Most statistical software calculates p-values for each test
• The researcher can decide the significance level post-factum
• p-values are usually found with 3 digits after the dot (x.xxx)
• The closer to 0.000 the p-value, the better

Should you need to calculate a p-value ‘manually’, we suggest using an online p-value
calculator, e.g. this one.
34

3.8 Formulae for Hypothesis Testing

is r
ns

e n

at fo
tic
nc tio
tio

tic

st la
nc
ria la

pl
la

st u
is

te orm
va opu

ria
pu

at
m
Sa

Va
po

F
#

One known -
z

One -
unknown t
2

𝑠
C 3- 2

Two
known independent z

unknown,
Two
assumed equal independent t

Decision rule
There are several ways to phrase the decision rule and they all have the same
meaning.
Reject the null if:

1 |test statistic| > |critical value|

) The absolute value of the test statistic is bigger than the absolute critical value
2
3)p-value < some significance level
)
most often 0.05
Usually, you will be using the p-value
to make a decision.

The Ultimate TMUA Guide: Complete revision for the Cambridge TMUA. Learn the knowledge, practice the skills, and master the TMUA
From Everand
The Ultimate TMUA Guide: Complete revision for the Cambridge TMUA. Learn the knowledge, practice the skills, and master the TMUA
Chloe Bowman
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Descriptive Statistics
No ratings yet
Descriptive Statistics
14 pages
Pertemuan 01 02
No ratings yet
Pertemuan 01 02
123 pages
الثالثة
No ratings yet
الثالثة
16 pages
Data Analitics For Business: Descriptive Statistics
No ratings yet
Data Analitics For Business: Descriptive Statistics
66 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
No ratings yet
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
38 pages
Decision Science: Ken Black
No ratings yet
Decision Science: Ken Black
296 pages
QM1 Notes
No ratings yet
QM1 Notes
81 pages
Statistics Day 1a - Types of Data, Graphical Representation, Correlation, Data Modeling & Index Numbers
No ratings yet
Statistics Day 1a - Types of Data, Graphical Representation, Correlation, Data Modeling & Index Numbers
54 pages
Data 1
No ratings yet
Data 1
62 pages
AE-9-REVIEWER
No ratings yet
AE-9-REVIEWER
7 pages
07. Data Visualization
No ratings yet
07. Data Visualization
53 pages
Stat 1&2
No ratings yet
Stat 1&2
35 pages
Chap 1 - 2: Business Statistics
No ratings yet
Chap 1 - 2: Business Statistics
38 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
Statistical Thinking and Methods For Describing Sets of Data
No ratings yet
Statistical Thinking and Methods For Describing Sets of Data
37 pages
1st Mid
No ratings yet
1st Mid
19 pages
Displaying Descriptive Statistics: Chapter 2 Map
No ratings yet
Displaying Descriptive Statistics: Chapter 2 Map
58 pages
BS2. Statistics
No ratings yet
BS2. Statistics
30 pages
3. Variables & Chart
No ratings yet
3. Variables & Chart
60 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
RM Data Analysis
No ratings yet
RM Data Analysis
67 pages
Matematik
No ratings yet
Matematik
26 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Module 2 - Descriptive Statistics - PPT-3
No ratings yet
Module 2 - Descriptive Statistics - PPT-3
31 pages
Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
Statistical Methods and Their Applications-I: II B.SC Computer Science
No ratings yet
Statistical Methods and Their Applications-I: II B.SC Computer Science
317 pages
Math
No ratings yet
Math
13 pages
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
No ratings yet
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
66 pages
Topic-1
No ratings yet
Topic-1
34 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
No ratings yet
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
32 pages
2. presenting of data_١١١٠٥٩
No ratings yet
2. presenting of data_١١١٠٥٩
39 pages
Descriptive Stats
No ratings yet
Descriptive Stats
39 pages
Describing and Interpreting Data: Variable
No ratings yet
Describing and Interpreting Data: Variable
9 pages
Camm BA 5e PPT CH02 03-09-23 PC - Final
No ratings yet
Camm BA 5e PPT CH02 03-09-23 PC - Final
52 pages
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
No ratings yet
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
39 pages
Tabular and Graphical Descriptive Techniques Using MS-Excel
No ratings yet
Tabular and Graphical Descriptive Techniques Using MS-Excel
20 pages
Mas 414
No ratings yet
Mas 414
21 pages
BADB1014 Quantitative Methods - Lesson 3
No ratings yet
BADB1014 Quantitative Methods - Lesson 3
23 pages
Lecture 1
No ratings yet
Lecture 1
28 pages
Session 3 Descriptive Analysis I-Frequency Distribution and Cross Tabulation
No ratings yet
Session 3 Descriptive Analysis I-Frequency Distribution and Cross Tabulation
30 pages
Statistics
No ratings yet
Statistics
289 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
Statistics-pages
No ratings yet
Statistics-pages
67 pages
Statistics 1232445944520487 1
No ratings yet
Statistics 1232445944520487 1
101 pages
Descriptive Statistics, Tables and Graphs 20
No ratings yet
Descriptive Statistics, Tables and Graphs 20
34 pages
Statistics For Business and Economics: Describing Data: Graphical
No ratings yet
Statistics For Business and Economics: Describing Data: Graphical
53 pages
QT Module-2
No ratings yet
QT Module-2
45 pages
MANM526-W1
No ratings yet
MANM526-W1
38 pages
Manual
No ratings yet
Manual
46 pages
Probability & Statistics: Methods For Describing Sets of Data
No ratings yet
Probability & Statistics: Methods For Describing Sets of Data
112 pages
Module 1A Basic Statistical Concepts
No ratings yet
Module 1A Basic Statistical Concepts
37 pages
Catatan Statisktik FIX
No ratings yet
Catatan Statisktik FIX
59 pages
quantitative analysis
No ratings yet
quantitative analysis
30 pages
Getting the Numbers Right
From Everand
Getting the Numbers Right
Adrienne Montgomerie
No ratings yet
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
Elements of Statistical Learning
From Everand
Elements of Statistical Learning
Swarnalata Verma
No ratings yet
Key Notes: Chapter-14 Statistics
No ratings yet
Key Notes: Chapter-14 Statistics
2 pages
Statistics and Probability Pretest Set A
100% (1)
Statistics and Probability Pretest Set A
2 pages
Measures of Variability: QD Q Q
No ratings yet
Measures of Variability: QD Q Q
6 pages
Excel Gym Demo
No ratings yet
Excel Gym Demo
6 pages
Kami Export - Dhrithi Anumandla - 9.5Mean Absolute Deviation-Practice
No ratings yet
Kami Export - Dhrithi Anumandla - 9.5Mean Absolute Deviation-Practice
8 pages
Statistics Formula Sheet and Tables 2020
No ratings yet
Statistics Formula Sheet and Tables 2020
6 pages
Ethio Coders
100% (4)
Ethio Coders
4 pages
Probability and Statistics: Progress Test 2
No ratings yet
Probability and Statistics: Progress Test 2
4 pages
Engineering Statistic Formulae
No ratings yet
Engineering Statistic Formulae
5 pages
Jaggia BA 2e Chap003 PPT
No ratings yet
Jaggia BA 2e Chap003 PPT
42 pages
Statistics - Descriptive Statistics
No ratings yet
Statistics - Descriptive Statistics
22 pages
MIDTERM EXAM Maqhanoy Educ 10
No ratings yet
MIDTERM EXAM Maqhanoy Educ 10
3 pages
DAY 2 - How Much Do You Get Paid?
No ratings yet
DAY 2 - How Much Do You Get Paid?
2 pages
13. Statistics
No ratings yet
13. Statistics
54 pages
Lab Activity No. 2
No ratings yet
Lab Activity No. 2
3 pages
13.exploratory Data Analysis
0% (1)
13.exploratory Data Analysis
10 pages
MMW Project Answer Sheet
No ratings yet
MMW Project Answer Sheet
12 pages
CRJ 511 HW2
No ratings yet
CRJ 511 HW2
2 pages
Obe Syllabus (Oblicon)
No ratings yet
Obe Syllabus (Oblicon)
43 pages
Percentile Rank
No ratings yet
Percentile Rank
8 pages
Descriptive Statistics Updated
No ratings yet
Descriptive Statistics Updated
38 pages
Macabingkel - Statistics On QC and Qa
No ratings yet
Macabingkel - Statistics On QC and Qa
4 pages
Standard Error of Mean SEM 30012023
No ratings yet
Standard Error of Mean SEM 30012023
6 pages
STAT - Measures of Shape
No ratings yet
STAT - Measures of Shape
5 pages
Lesson 4 Measure of Central Tendency or Position Activity 67
No ratings yet
Lesson 4 Measure of Central Tendency or Position Activity 67
3 pages
MAS1209_Statistic for Managers_Dr. Ashok Kumar Pal
No ratings yet
MAS1209_Statistic for Managers_Dr. Ashok Kumar Pal
6 pages
Histograma Asimetrica Inaltimea Barajelor Din Romania, M: Column1 Bin
No ratings yet
Histograma Asimetrica Inaltimea Barajelor Din Romania, M: Column1 Bin
6 pages
Experiment-1 2
No ratings yet
Experiment-1 2
6 pages
Informatics Practices: Class XII (As Per CBSE Board)
No ratings yet
Informatics Practices: Class XII (As Per CBSE Board)
20 pages
Ecn 2331 Statistics For Economics Lesson 2 Part 2
No ratings yet
Ecn 2331 Statistics For Economics Lesson 2 Part 2
53 pages