Basics
Basics
This page describes some (but not yet all) basic terms and concepts in statistics and regression
analysis.
Random variable
A random variable (or stochastic variable) is a variable that can take a random value from a set of
possible values. Each value has an associated probability, which decides the likelihood of it being
chosen.
Probability distribution
Normal distribution
The expected value or population mean (µ) is the average value we would expect to find from a
random variable if we repeated an experiment an infinite number of times. In theory, the results is
the same as the average or arithmetic mean value, i.e. the sum of all values divided by the number
of values, although it’s calculated a bit differently.
Definition
The expected value is the sum of all possible values for a random variable, each value multiplied by
its probability of occurrence.
𝐸(𝑋) = 𝑥1 𝑝1 + 𝑥2 𝑝2 + ⋯ + 𝑥𝑛 𝑝𝑛
Where:
Example
Variance
Variance is a measure of the spread of values in a random variable. The larger the variance, the
greater the spread of values. For example, the two numbers 0 and 40 have a larger variance than 10
and 30, because they are more spread apart. In general, zero variance means that the values are
identical.
Definition
The variance of a random variable is the expected value of the squared deviation form the mean:
µ is the mean, which is the same as the expected value of X, i.e. µ = E(x)
Example (simplified)
Let’s say we have eight data points with the values 2+4+4+4+5+5+7+9.
µ = (2+4+4+4+5+5+7+9) / 8 = 5
For each value, we take its deviation from the mean and square it:
(𝑥𝑖 − 𝑥̅ )2
(2-5)^2 = 9
(4-5)^2 = 1
(4-5)^2 = 1
(4-5)^2 = 1
(5-5)^2 = 0
(5-5)^2 = 0
(7-5)^2 = 4
(9-5)^2 = 16
We then get the mean (expected value) of these to get the variance:
Var(X) = (9+1+1+1+0+0+4+16)/8 = 2
σ = sqrt(2) = 1.41421356237
This means that the variance of X is the standard deviation (σ) of X squared: Var(x) = σ^2
Standard deviation
The standard deviation (σ) is another way of expressing the variance, calculated as the square root
of the variance.
Definition
σ = sqrt(Var(x))
σ^2 = Var(x)
Covariance
Covariance is a measure of the relationship between two random variables. How much the two
variables vary together, and in what direction. The covariance is positive when variables move in the
same direction, and negative when they move in opposite directions. Zero covariance means there is
no linear relationships between them.
Covariance is measured in the same units as the variables, making it hard to compare between
variables. Correlation fixed this by standardizing the values, giving us a fixed range of -1 to 1.
Definition
If we take the covariance of a variable with itself, this simply equals its variance:
𝐶𝑜𝑣(𝑥, 𝑥) = 𝑉𝑎𝑟(𝑥)
Correlation
Correlation is measured of the relationship between two random variables. How much the two
variables vary together, and in what direction. It’s the same as the covariance, except it uses a
standardized range of values between -1 to1 wile covariance is measured in the same unit as the
variables. A value of 1 means a perfect positive relationship, -1 a prefect negative relationship, and 0
means no relationship at all.
Note that correlation does not imply causation. Just because there is a statistical relationship
between two things does not mean that one causes the other, only that they seem to occur at
roughly the same time. This also holds for covariance.
Definition
𝐶𝑜𝑣(𝑥, 𝑦)
𝐶𝑜𝑟𝑟(𝑥, 𝑦) =
√(𝑉𝑎𝑟(𝑥) ∗ 𝑉𝑎𝑟(𝑦))
Data types
- Cross-sectional data: Data on many subjects at a certain point of time. The subject could be
individuals, firms, countries, regions, or something else. Example: the income of households
in Sweden in 2009.
- Time series data: Data on a single subject over time. Examples: the daily profit & loss of a
specific company over time, or the inflation rate of a country over many years.
- Panel data: Data on many subjects over time. A mix between cross-sectional and time series
data. Panel data is said to be “multi-dimensional” while the others are “one-dimensional”.
Examples: the income of many households in Sweden over time, the daily profit of multiple
companies over time.
Exogenous - The variable is completely outside the model, does not depend on any of the variables
in the model (not even the residual). That’s what we want.
Endogenous - The variable depends on at least one of the other variables in the model.
Logarithms:
log(y) + log(x): dlog y/ dlog x = dy/y / dx/x = Elasticity
log(y) + normal x: When x increases by 1, y increases by %. This is an approximation, which will be
less exact when the coef gets larger.
If y can be 0 or 1, the expected value of y can be interpreted as the probability that y i equal to 1.
Therefore, multiple linear regression model with binary dependent variable is called the linear
probability model (LPM).
Time-series data
Logarithmic form common to eliminate scale effects.
Dummy variables often used to identify an event or to isolate a shock. Also for capturing seasonality.
Index numbers (e.g. CPI) often used an independent variable.
Static model:
Static Philips curve:
inf_t = B0 + B1 * unem_t + u_t
Inflation and unemployment a given year.
Difference from cross-sectional model is replacing i with t. Only estimates immediate effects on the
dependant variable, i.e. that takes place the same year.
Shortcomings: The higher the number of lags you use, the more data you lose (why?)
Example:
How interest rate at time t is impacted by inflation at times t, t-1 and t-2. After running regression we
get:
int_t = 1.6 + 0.48inf_t - 0.15inf_t-1 + 0.32inf_t-2 + u_t
Impact propensity/multiplier:
Impact propensity is 0.48.
Long-run propensity/multiplier
Long-run propensity is 0.48-0.15+0.32 = 0.65
Trends:
Many economic time series display a trending behavior over time, which might be important to
incorporate in our model. Two series might seem related just because they follow the same trend.
Danger of ignoring trends: Omitted variable bias.
Example: The average growth rate in GDP per capita for Sweden during 1971-2012 is 1.7%, hence:
y_t = log(gdp_per_capita), then a1 = 0.017
Seasonality:
For example: If we suspect seasonality each quarter:
y_t = B0 + y1Q2 + y2Q3 + y3Q4 + b1x_t,1 + b2x_t,2 + t_t
If there is no seasonality we would find that all y = 0 which can be tested with an F-test.
Autocorrelation function (ACF):
ACF for lag one (i.e. one time unit back):
corr(rt, rt-1) = cov(rt, rt-1) / sqrt(var(rt) * var(rt-1)) = cor(rt, rt-1) / var(rt) = ACF(1)
Should decrease at larger time gaps, i.e. ACF(1) is larger than ACF(2). If ACF(1) is small we have less
dependancy between time period t and t-1.
E.g.
ac rus, lags(12)
ac dyus, lags(12)
Grey area is the non-rejection area, where we cannot reject that there is no dependancy between
time periods, i.e. there is a chance there is no dependancy. Outside the area we can be sure there
some kind of dependancy.
In financial markets, if markets are efficient, we have zero arbitage. So we have 0 predictability.
We can replace, for example, TS.3 with a weaker assumption (which one?).
For example, if we want to control for a demand shock but cannot isolate the demand shock itself,
we can use an instrumental variable instead.
Good instrument:
1) Relevant: Contains some information that has predictive power:
corr(Z, lfare) # 0
corr(bmktshr, lfare) > 0
2) Validity: corr(Z, E) = 0
corr(bmktshr, lfare) = 0
Step 1) Predict the variable we want to replace using the new instrumental variable.
Step 2) Replace the variable with the predicted variable in the original OLS regression
Stata example:
Perform step 1 and 2 with one command, replacing lfare with instrumental variable bmktshr:
ivregress 2sls lpassen (lfare=bmktshr) ldist ldist2, first