0% found this document useful (0 votes)
14 views

Basics

This document provides definitions and explanations of basic statistical and regression analysis terms including: - Random variables can take random values from a set of possible values with associated probabilities. The two main types are discrete and continuous. - A probability distribution describes possible values of a random variable and their probabilities. The normal distribution is the most common. - The expected value or population mean is the average value expected from repeating an experiment infinitely. - Variance measures the spread of values in a random variable. Standard deviation is the square root of variance. - Covariance measures how variables vary together. Correlation standardizes covariance between -1 and 1.

Uploaded by

grahn.elin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Basics

This document provides definitions and explanations of basic statistical and regression analysis terms including: - Random variables can take random values from a set of possible values with associated probabilities. The two main types are discrete and continuous. - A probability distribution describes possible values of a random variable and their probabilities. The normal distribution is the most common. - The expected value or population mean is the average value expected from repeating an experiment infinitely. - Variance measures the spread of values in a random variable. Standard deviation is the square root of variance. - Covariance measures how variables vary together. Correlation standardizes covariance between -1 and 1.

Uploaded by

grahn.elin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Basics

This page describes some (but not yet all) basic terms and concepts in statistics and regression
analysis.

Random variable

A random variable (or stochastic variable) is a variable that can take a random value from a set of
possible values. Each value has an associated probability, which decides the likelihood of it being
chosen.

The two most common types of random variables are:

- Discrete random variables: van take a limited number of fixed values.


Examples: a dice (1,2,3,4,5 or 6), the number of children in a family.
- Continuous random variables: Can take any values on a scale.
Examples: the interest rate, stock market indexes, household income.

Probability distribution

A probability distribution is a mathematical function describing the possible values of a random


variable and their associated probabilities. The most common type of probability distribution is the
normal distribution.

Normal distribution

The normal distribution (or bell curve) is


the most common probability
distribution. It has a distinct shape which
makes it easy to remember.

Some even say that any random variables


will end up being normally distributed,
given a large enough sample size (i.e. as
long as the experiment is repeated
enough times. This I known as the central
limit theorem (CLT).

Expected value (population mean)

The expected value or population mean (µ) is the average value we would expect to find from a
random variable if we repeated an experiment an infinite number of times. In theory, the results is
the same as the average or arithmetic mean value, i.e. the sum of all values divided by the number
of values, although it’s calculated a bit differently.
Definition

The expected value is the sum of all possible values for a random variable, each value multiplied by
its probability of occurrence.

𝐸(𝑋) = 𝑥1 𝑝1 + 𝑥2 𝑝2 + ⋯ + 𝑥𝑛 𝑝𝑛
Where:

E(X) is the expected value of random variable X.

𝑥1 to 𝑥𝑛 represents all the possible values that X can take.

𝑝1 to 𝑝𝑛 is the probability of this value occurring.

Example

The expected value for rolling a dice is:


1 1 1 1 1 1
𝐸(𝑋) = 1 ∗ + 2 ∗ + 3 ∗ + 4 ∗ + 5 ∗ + 6 ∗ = 3.5
6 6 6 6 6 6
There is a 1/6 chance that the dice lands on each side, If we role the dice an infinite number of times,
the average value should be 3.5.

Variance

Variance is a measure of the spread of values in a random variable. The larger the variance, the
greater the spread of values. For example, the two numbers 0 and 40 have a larger variance than 10
and 30, because they are more spread apart. In general, zero variance means that the values are
identical.

Definition

The variance of a random variable is the expected value of the squared deviation form the mean:

𝑉𝑎𝑟(𝑥) = 𝐸[(𝑥 − µ)2 ]


Where:

Var(x) is the variance of the random variable X

µ is the mean, which is the same as the expected value of X, i.e. µ = E(x)

Example (simplified)

Let’s say we have eight data points with the values 2+4+4+4+5+5+7+9.

The mean (µ) of these values are:

µ = (2+4+4+4+5+5+7+9) / 8 = 5
For each value, we take its deviation from the mean and square it:

(𝑥𝑖 − 𝑥̅ )2
(2-5)^2 = 9

(4-5)^2 = 1

(4-5)^2 = 1

(4-5)^2 = 1

(5-5)^2 = 0

(5-5)^2 = 0

(7-5)^2 = 4

(9-5)^2 = 16

We then get the mean (expected value) of these to get the variance:

Var(X) = (9+1+1+1+0+0+4+16)/8 = 2

The standard deviation (σ) is the square root of the variance:

σ = sqrt(2) = 1.41421356237

This means that the variance of X is the standard deviation (σ) of X squared: Var(x) = σ^2

Standard deviation

The standard deviation (σ) is another way of expressing the variance, calculated as the square root
of the variance.

Definition

The square root of the variance:

σ = sqrt(Var(x))

or you can write:

σ^2 = Var(x)

σ is the standard deviation of the random variable X

Var(X) is the variance of X

Covariance

Covariance is a measure of the relationship between two random variables. How much the two
variables vary together, and in what direction. The covariance is positive when variables move in the
same direction, and negative when they move in opposite directions. Zero covariance means there is
no linear relationships between them.

Covariance is measured in the same units as the variables, making it hard to compare between
variables. Correlation fixed this by standardizing the values, giving us a fixed range of -1 to 1.

Definition

𝐶𝑜𝑣 (𝑥, 𝑦) = 𝐸(𝑥 − 𝐸(𝑥)) ∗ (𝑦 − 𝐸(𝑦)) = 𝐸(𝑥𝑦) − 𝐸(𝑥) ∗ 𝐸(𝑦)

If we take the covariance of a variable with itself, this simply equals its variance:

𝐶𝑜𝑣(𝑥, 𝑥) = 𝑉𝑎𝑟(𝑥)

Correlation

Correlation is measured of the relationship between two random variables. How much the two
variables vary together, and in what direction. It’s the same as the covariance, except it uses a
standardized range of values between -1 to1 wile covariance is measured in the same unit as the
variables. A value of 1 means a perfect positive relationship, -1 a prefect negative relationship, and 0
means no relationship at all.

Note that correlation does not imply causation. Just because there is a statistical relationship
between two things does not mean that one causes the other, only that they seem to occur at
roughly the same time. This also holds for covariance.

Definition
𝐶𝑜𝑣(𝑥, 𝑦)
𝐶𝑜𝑟𝑟(𝑥, 𝑦) =
√(𝑉𝑎𝑟(𝑥) ∗ 𝑉𝑎𝑟(𝑦))

Data types

There are different types of data, which require different methods:

- Cross-sectional data: Data on many subjects at a certain point of time. The subject could be
individuals, firms, countries, regions, or something else. Example: the income of households
in Sweden in 2009.
- Time series data: Data on a single subject over time. Examples: the daily profit & loss of a
specific company over time, or the inflation rate of a country over many years.
- Panel data: Data on many subjects over time. A mix between cross-sectional and time series
data. Panel data is said to be “multi-dimensional” while the others are “one-dimensional”.
Examples: the income of many households in Sweden over time, the daily profit of multiple
companies over time.
Exogenous - The variable is completely outside the model, does not depend on any of the variables
in the model (not even the residual). That’s what we want.

Endogenous - The variable depends on at least one of the other variables in the model.

Logarithms:
log(y) + log(x): dlog y/ dlog x = dy/y / dx/x = Elasticity
log(y) + normal x: When x increases by 1, y increases by %. This is an approximation, which will be
less exact when the coef gets larger.

To get exact percent:


100 * [exp(coef) - 1]

Limited Dependent Variable (LDV):


When the dependent variable (y) is a dummy/binary/boolean/qualitative variable (i.e. it can only
have the value 0 or 1).

Often ok to use OLS even with LDV.

If y can be 0 or 1, the expected value of y can be interpreted as the probability that y i equal to 1.
Therefore, multiple linear regression model with binary dependent variable is called the linear
probability model (LPM).

Time-series data
Logarithmic form common to eliminate scale effects.
Dummy variables often used to identify an event or to isolate a shock. Also for capturing seasonality.
Index numbers (e.g. CPI) often used an independent variable.

Static model:
Static Philips curve:
inf_t = B0 + B1 * unem_t + u_t
Inflation and unemployment a given year.

Difference from cross-sectional model is replacing i with t. Only estimates immediate effects on the
dependant variable, i.e. that takes place the same year.

Finite Distributed Lag Models (FDL)


y_t = B0 + B1 * z_t + B2 * z_t-1 + B3 * z_t-2 + u_t
This model states that y is affected by a change in z in period t, but also by changes in z that
happened earlier (at times t-1 and t-2).

Has a high risk of omitted variable bias.

Shortcomings: The higher the number of lags you use, the more data you lose (why?)

Example:
How interest rate at time t is impacted by inflation at times t, t-1 and t-2. After running regression we
get:
int_t = 1.6 + 0.48inf_t - 0.15inf_t-1 + 0.32inf_t-2 + u_t

Impact propensity/multiplier:
Impact propensity is 0.48.

Long-run propensity/multiplier
Long-run propensity is 0.48-0.15+0.32 = 0.65

Trends:
Many economic time series display a trending behavior over time, which might be important to
incorporate in our model. Two series might seem related just because they follow the same trend.
Danger of ignoring trends: Omitted variable bias.

Linear time trend:


y_t = a0 + a1_t + e_t

Example: The average growth rate in GDP per capita for Sweden during 1971-2012 is 1.7%, hence:
y_t = log(gdp_per_capita), then a1 = 0.017

Seasonality:
For example: If we suspect seasonality each quarter:
y_t = B0 + y1Q2 + y2Q3 + y3Q4 + b1x_t,1 + b2x_t,2 + t_t
If there is no seasonality we would find that all y = 0 which can be tested with an F-test.
Autocorrelation function (ACF):
ACF for lag one (i.e. one time unit back):
corr(rt, rt-1) = cov(rt, rt-1) / sqrt(var(rt) * var(rt-1)) = cor(rt, rt-1) / var(rt) = ACF(1)

ACF for lag s:


ACF(s) = sum(t=s+1 to t) for ((rt - _r) * (rt-s - _r)) / sum(t=1 to t) for (rt - _r)^2
ACF(s) = E

Should decrease at larger time gaps, i.e. ACF(1) is larger than ACF(2). If ACF(1) is small we have less
dependancy between time period t and t-1.

In Stata, this can be calculated automatically by


ac dependent_var, lags(s)

E.g.
ac rus, lags(12)
ac dyus, lags(12)

Grey area is the non-rejection area, where we cannot reject that there is no dependancy between
time periods, i.e. there is a chance there is no dependancy. Outside the area we can be sure there
some kind of dependancy.

In financial markets, if markets are efficient, we have zero arbitage. So we have 0 predictability.

Stationary and Weak Dependence:


Time series observations can rarely if ever be assumed to be independent. This might imply that the
CLM assumptions do not hold. However, the OLS estimator mi. If we assume that our data are
stationary and weakly dependent, we can modify TS.1-3.

We can replace, for example, TS.3 with a weaker assumption (which one?).

Instrumental variables (IV):


When there’s a correlation with an independent/explanatory variable and the error term (i.e. MLR.4
doesn’t hold), we can use instrumental variables. Use an indirect variable related with the
explanatory variable to control for this.

For example, if we want to control for a demand shock but cannot isolate the demand shock itself,
we can use an instrumental variable instead.

Good instrument:
1) Relevant: Contains some information that has predictive power:
corr(Z, lfare) # 0
corr(bmktshr, lfare) > 0
2) Validity: corr(Z, E) = 0
corr(bmktshr, lfare) = 0

Step 1) Predict the variable we want to replace using the new instrumental variable.
Step 2) Replace the variable with the predicted variable in the original OLS regression

Stata example:
Perform step 1 and 2 with one command, replacing lfare with instrumental variable bmktshr:
ivregress 2sls lpassen (lfare=bmktshr) ldist ldist2, first

You might also like