Statistics Course Notes
Statistics Course Notes
Iliya Valchanov
Statistics
2
Table of Contents
Abstract .................................................................................................................................... 3
1. Descriptive Statistics ......................................................................................................... 4
1.1 Types of Data................................................................................................................. 4
1.2 Levels of Measurment ................................................................................................ 5
1.3 Graphs and Tables that Represent Categorical Variables ........................................ 6
1.3.1 Excel Formulas ..................................................................................................... 7
1.3.2 Pareto Diagrams in Excel .................................................................................... 8
1.4 Graphs and Tables that Represent Numerical Variables.................................. ....... 9
1.4.1 Frequency Distriubution Table and Histogram .............................................. 10
Abstract
1. Descriptive Statistics
Types of data
Categorical Numerical
Examples:
Discrete: # children you want to have,
SAT score Continuous: weight, height
5
Levels of measurement
Qualitative Quantitative
There are two qualitative levels: There are two quantitative levels:
nominal and ordinal. The nominal interval and ratio. They both represent
level represents categories that “numbers”, however, ratios have a true
cannot be put in any order, while zero, while intervals don’t.
ordinal represents categories that can
be ordered. Examples:
Interval: degrees Celsius and
Examples: Nominal: four seasons Fahrenheit Ratio: degrees Kelvin,
(winter, spring, summer, autumn) length
Ordinal: rating your meal (disgusting,
unappetizing, neutral, tasty, and
delicious
6
Frequency
distribution Bar charts Pie charts Pareto
tables Diagrams
Sales
Frequency 150
124
Frequency
Audi
100
BMW 98
Mercedes 113 50
335 124 98 113
Total 0
Audi BMW Mercedes
Frequency distribution tables show Bar charts are very common. Each bar
the category and its corresponding represents a category. On the y-axis
absolute frequency. we have the absolute frequency.
Sales
Mercedes Audi
150
Frequency
100%
34% 37%
80%
100
60%
50 40%
20%
124 113 98
BMW 0 0%
29% Audi BMW Mercedes
Pie charts are used when we want to The Pareto diagram is a special type
see the share of an item as a part of of bar chart where the categories
the total. Market share is almost are shown in descending order of
always represented with a pie chart. frequency, and a separate curve
shows the cumulative frequency.
7
Frequency
distribution Bar charts Pie charts Pareto
tables Diagrams
Sales
Frequency 150
124
Frequency
Audi
100
BMW 98
Mercedes 113 50
335 124 98 113
Total 0
Audi BMW Mercedes
In Excel, we can either hard code the Bar charts are also called clustered
frequencies or count them with a column charts in Excel. Choose your
count function. This will come up later data, Insert
on. -> Charts -> Clustered column or
Total formula: =SUM() Bar chart.
Sales
Mercedes Audi
Frequency
150 100%
34% 37%
80%
100
60%
50 40%
20%
0 124 113 98
BMW 0%
29% Audi BMW Mercedes
Sales
150 100%
Frequency
90%
80%
100 70%
60%
50%
40%
50
30%
20%
124 113 98 10%
0 0%
Audi BMW Mercedes
14. Done.
9
Frequency distribution tables for numerical variables are different than the ones for
categorical. Usually, they are divided into intervals of equal (or unequal) length. The
tables show the interval, the absolute frequency and sometimes it is useful to also
include the relative (and cumulative) frequencies.
Note that all formulas could be found in the lesson Excel files and the solutions of the
exercises provided with each lesson.
10
Histogram
7
6
5
4
3
2
1
0
[1,21] (21,41) (41,61] (61,81] (81,101]
0.30
0.25
0.20
0.15
0.10
Histograms are the one of the most common ways to represent numerical data.
Each bar has width equal to the width of the interval. The bars are touching as there is
continuation between intervals: where one ends -> the other begins.
11
Cross tables (or contingency tables) are used to represent categorical variables.
One set of categories is labeling the rows and another is labeling the columns. We
then fill in the table with the applicable data. It is a good idea to calculate the totals.
Sometimes, these tables are constructed with the relative frequencies as shown in the
table below.
A common way to represent the data from a cross table is by using a side-by-side bar
chart
Selecting more than one series (groups of data) will automatically prompt Excel to
create a side-by-side bar (column) chart.
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800
When we want to represent two numerical variables on the same graph, we usually use a scatter
plot. Scatter plots are useful especially later on, when we talk about regression analysis, as they
help us detect patterns (linearity, homoscedasticity).
Scatter plots usually represent lots and lots of data. Typically, we are not interested in single
observations, but rather in the structure of the dataset.
26
25.5
25
24.5
24
23.5
22.5
22
21.5
0 20 40 60 80 100 120 140 160 180
A scatter plot that looks in the following way (down) represents data that doesn’t have a
pattern. Completely vertical ‘forms’ show no association.
Conversely, the plot above shows a linear pattern, meaning that the observations move
together.
13
The mean is the The median is the The mode is the value
most widely spread midpoint of the that occurs most often.
measure of central ordered dataset. It is A dataset can have
tendency. It is the not as popular as the 0 modes, 1 mode or
simple average of the mean, but is often multiple modes.
dataset. used in academia
and data science. The mode is
Note: easily affected
That is since it is not calculated simply by
by outliers
affected by outliers. finding the value
The formula to with the highest
calculate the mean is: frequency.
In an ordered dataset,
the median is the In Excel, the mode
number at position is calculated by:
=MODE.SNGL() ->
returns one mode
If this position is not =MODE.MULT() ->
a whole number, it, returns an array with
the median is the the modes. It is used
In Excel, the mean simple average of
is calculated by: when we have more
the two numbers at than 1 mode.
=AVERAGE() positions closest to
the calculated value.
In Excel, the
median is calculated
by:
=MEDIAN()
14
1.7 Skewness
Median Mean
Mode
=SKEW()
Sample variance:
=VAR.S()
Population variance:
Point 2 Mean Point 5 =VAR.P()
Sample standard deviation:
= STDEV.S()
Population standard deviation:
Point 3 Point 6 =STDEV.P()
There are different formulas for population and sample variance & standard
deviation. This is due to the fact that the sample formulas are the unbiased estimators
of the population formulas. More on the mathematics behind it.
Covariance Correlation
2. Inferential Statistics
2.1 Distributions
Normal distribution
Student’s T distribution
18
The Normal distribution is also known as Gaussian distribution or the Bell curve. It is
one of the most common distributions due to the following reasons:
• It approximates a wide variety of random variables
• Distributions of sample means with large enough samples sizes could be
approximated to normal
• All computable statistics are elegant
• Heavily used in regression analysis
• Good track record
𝑁~(𝜇 , 𝜎2 )
length of arms, legs, nails; blood pressure; thickness of tree barks, etc.
• IQ tests
• Stock market information
19
Origin
μ = 470 μ = 743 μ = 960
Keeping the standard deviation constant, the graph of a normal distribution with: • a
smaller mean would look in the same way, but be situated to the left (in gray) • a
larger mean would look in the same way, but be situated to the right (in red)
20
σ = 70
σ = 140
σ = 210
Origin
μ = 743 μ = 743 μ = 743
N~
(0, 1)
Rationale of the formula for standardization:
We want to transform a random variable from N~ μ, σ² to N~(0,1).
Subtracting the mean from all observations would cause a transformation from N~ μ,σ²
to N~ 0, σ² , moving the graph to the origin.
Subsequently, dividing all observations by the standard deviation would cause a
transformation from N~ 0, σ² to N~ 0,1, standardizing the peak and the tails of the
graph.
22
The Central Limit Theorem (CLT) is one of the greatest statistical insights. It states
that no matter the underlying distribution of the dataset, the sampling distribution
of the means would approximate a normal distribution. Moreover, the mean of the
sampling distribution would be equal to the mean of the original distribution and the
variance would be n times smaller, where n is the size of the samples. The CLT applies
whenever we have a sum or an average of many variables (e.g. sum of rolled numbers
when rolling dice).
Estimators Estimates
General formula:
where ME is the margin of error.
ME
√
= re liability factor ∗ =𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑠𝑎𝑚𝑝𝑙 𝑒 𝑠𝑖𝑧𝑒
25
Student’s T
distribution
Normal distribution
A random variable following the t-distribution is denoted 𝑡υ,α, where υ are the degrees
of freedom.
We can obtain the student’s T distribution for a variable with a Normally distributed
population using the formula:
26
is r
ns
e n
at fo
tic
nc tio
tio
es
tic
st la
nc
ria la
pl
la
st u
is
te orm
va opu
ria
pu
at
m
Sa
St
Va
po
F
#
One known - z
2
One - t
unknown
2
- t 2
Two dependent
difference
𝑠
Two z
known independent
unknown,
Two assumed t
independent
equal
unknown,
assumed
Two
different
independent t
27
3. Hypothesis Testing
The ‘scientific method’ is a procedure that has characterized natural science since the
17th century. It consists in systematic observation, measurement, experiment, and
the formulation, testing and modification of hypotheses.
Since then we’ve evolved to the point where most people and especially
professionals realize that pure observation can be deceiving. Therefore, business
decisions are increasingly driven by data. That’s also the purpose of data science.
While we don’t ‘name’ the scientific method in the videos, that’s the underlying
idea. There are several steps you would follow to reach a data-driven decision
(pictured).
1 2 3 4
Formulate a Find the right Execute the Make a
hipothesis rest test decision
28
3.2 Hypotheses
It is a supposition or proposed
A hypothesis is explanation made on the basis of
“an idea that can be tested” limited evidence as a starting point for
further investigation.
accept
Rejection region Rejection region
0
reject reject
Graphically, the tails of the distribution show when we reject the null hypothesis
(‘rejection region’).
Everything which remains in the middle is the ‘acceptance region’.
The rationale is: if the observed statistic is too far away from 0 (depending on the
significance level), we reject the null. Otherwise, we accept it
Accept Reject
At x% significance, we accept the null At x% significance, we reject the null
hypothesis hypothesis
At x% significance, A is not At x% significance, A is significantly
significantly different from B different from B
At x% significance, there is not At x% significance, there is enough
enough statistical evidence that… At statistical evidence… At x%
x% significance, we cannot reject the significance, we cannot say that
null hypothesis *restate the null*
31
α/2 = 0.05
Rejection region Rejection region
accept α / 2 = 0.02 5
α / 2 = 0.02 5
Rejection region
α = 0.05
32
In general, there are two types of errors we can make while testing: Type I error (False
positive) and Type II Error (False negative).
The truth
Ho is true Ho is false
Type II error
Accept
(False negative)
Ho (Status quo)
Type I error (False
Reject
positive)
The truth
The probability of committing Type I error (False positive) is equal to the significance
level (α).
The probability of committing Type II error (False negative) is equal to the beta (β).
If you want to find out more about statistical errors, just follow this link for an article
written by your instructor.
33
3.7 P-value
Should you need to calculate a p-value ‘manually’, we suggest using an online p-value
calculator, e.g. this one.
34
is r
ns
e n
at fo
tic
nc tio
tio
es
tic
st la
nc
ria la
pl
la
st u
is
te orm
va opu
ria
pu
at
m
Sa
St
Va
po
F
#
One known -
z
One -
unknown t
2
𝑠
C 3- 2
Two
known independent z
unknown,
Two
assumed equal independent t
Decision rule
There are several ways to phrase the decision rule and they all have the same
meaning.
Reject the null if: