Lecture 1: Introduction and Key Concepts
Lecture 1: Introduction and Key Concepts
Concepts
Econometric theory
Links mathematical, statistical and economic theory with data
Key concepts: estimation and causal inference
Machine learning
Designed to handle large datasets and complex relationships
Focuses on prediction accuracy
Key algorithms: neural networks, natural language processing models (NLP)
Example: image recognition
Textbooks
ISL: “An introduction to statistical learning: with applications in Python” James, G., Witten, D.,
Hastie, T., Tibshirani, R., & Taylor, J. Springer-Verlag. https://www.statlearning.com/
– easy to read, not very advanced on theory, examples of Python code for practice.
ESL: “The Elements of Statistical Learning” by Hastie, T., Tibshirani, R., Friedman, J. H., &
Friedman, J. H. https://hastie.su.domains/ElemStatLearn/
– theory-oriented, suitable for students with a sound math background, who want to
understand the theory better.
Maths notation
We use upper case letters such as Y to denote random variables.
Lower case letters denote observed values. For example, y denotes the realised value of the
random variable Y.
We use i to index the observations, j to index the explanatory (independent) variables or
predictors, features or inputs. For example, yi is the dependent variable or response for
observation i, while xij is the value of predictor j for observation i.
We use the hat notation for estimates.
Vectors are in lower-case letters such as x. Matrices are in upper-case letters, X.
Key concepts
Setting
Noise could be: measurement error, other factors that could influence the model that are not
accounted for (i.e. model imprecision)
Inference
Question of which media is most effective is less about prediction but an inference that
applies the model to give advice
This would require at least the properties of ^f
Estimation/training
If x is bivariate, then…
Noise – because we don’t know where the noise is located – i.e. a noise in Y or noise in x
o Application consequence: estimation will be accurate for this sample but perform
badly outside of the given sample
o See slide 22
Overfitting
Nonparametric Methods
Make no strong assumptions about the functional form of the data.
Flexibility increases as sample size grows.
Advantages: flexible – can model complex relationships.
Disadvantages: prone to overfitting, especially with small datasets; less interpretable and
computationally expensive with large datasets.
Flexibility vs interpretability
A main trade-off in econometrics – highly flexible, non-parametric methods tend to be less
interpretable than simpler methods
Conclusion from above graph: As we increase the number of input variables, the training MSE
reduces
Test MSE reaches optimum at around 3
Essentially, if the variance is large (i.e. due to overfitting), there is less bias.
As a result, we use MSE to resolve this trade-off because at the point where the MSE is the
lowest is where the bias-variance trade-off is also minimised (see in blue)
Lecture 2: Linear regression and
estimation techniques
Linear regression
We do this by trying the minimise the distance between the regression line and each data
point for X
Advantages:
o Popular for a more complicated model
o Useful when you can easily simulate your data, but cannot create a formula
Recall that ^β for MLE is the same as ^β for OLS – so we do not need to prove the estimator
Another source of bias is simultaneity, which does not explain what the causation relationship is
e.g. Budget and GDP – i.e. does budget affect GDP or the other way around.
Gauss-Markov theorem
Recall SE( ^β ) is the square root of variance-covariance matrix
Distribution equals to t-distribution because we do not have the true value of SE (only an
estimation) – generally for smaller data sets
F statistic – tries to see what is the difference between model under assumptions of normal
model and model you are estimating
Sales advertising example
Tutorial 2 notes
General idea: there is a problem, we have some data relating to the problem (where the data is
typically a sample of a larger population), we do modelling to either learn something about the
relationship between variables (e.g. policy analysis) or to predict something
Sample splitting
Test on 1st sample, 2nd sample, etc and where it performs well on most of them, then we can
be reasonably confident that it is not underfitting nor overfitting
Fit model on one part, then evaluate the performance of the model on the other part – tool
used to emulate generalising a model across different samples
o We also need to make sure the split samples are representative of the data set –
therefore, we cannot take observations systematically
Does not necessarily mean that the result gives you an optimal model
Strategy is not full proof because there is a lot of randomness – i.e. we can get completely
different results if we use a different seed
MSE plot
Validation error for test MSE increasing after degree 10 – suggests there is an overfitting
when this MSE increasing
There is underfitting, when there is a decrease in MSE from degree 1 and 2
We want the validation error to be as small as possible – between degrees 2 and 10
Method of models
Loss functions
We want to minimise this loss function in order to find an appropriate model
Although quadratic loss is the most widely used, the others may be more useful depending
on the application
o Absolute loss for discrete variables
o Log-likelihood for forecasting
Expected loss
In the ideal case of rich-data, we can divide the data into two sets: training set and test set
• use the training set to fit/estimate the model and the test set to estimate the average loss.
• Pick up the model with the smallest prediction error
But...
• data is precious
• we can do better with cross-validation
• or criteria based on loss which penalize complexity – e.g. R^2 value
Information criteria
Theoretical justification
Cross-validation
This does not make any assumptions as to the distribution
K-fold
The selected model is the one with the smallest CV prediction error.
Typical choices of K are 5, 10 or n. The case K = n is known as leave-one-out cross-validation.
Cross-validation is simple and widely used. However, CV can be sometimes very
computationally expensive because one has to fit the model many times.
Benefit:
o Do not need to fit model as many times as LOOCV – only k times as opposed to n
times
Model/Variable selection
Best subset selection: Search over the totally 2p possible subsets of p covariates to find the best
subset. The criterion can be AIC, BIC or Cross-validation any other proper model selection criteria.
Challenge with this is that 2p can be big very quickly and variable selection here is repetitive
as you take subsets until you find the best one
Regularisation: Ridge
−1
Why is ( X ' X ) large?
o Recall we need to find determinant for inverse
This is useful when you suspect you will have high variance (i.e. or if there is
multicollinearity)
In this example, t is the radius of the sphere
Effectively the blue circle makes the values of the parameters smaller
Useful to reduce the penalty due to more variables (which we want to prevent omitted
variable bias)
Tutorial 3
import pandas as pd
Library for the following and for data manipulation
Inspecting
We can check the first 5 rows of the dataframe marketing by using the head function
e.g. marketing.head()
To see the full information about a dataframe including number of entries and datatypes use
the info() function
e.g. marketing.info()
Dropping a column
to_drop = ['Age', 'Gender', 'Married', 'Children']
no_demographic_info = marketing.drop(columns = to_drop)
Pyplot library
import matplotlib.pyplot as plt
The numpy.random.randn() function creates an array of specified shape and fills it with random
values as per standard normal distribution.
Example:
my_figure3 = plt.figure()
n, bins, patches = plt.hist(amount_spent, rwidth = 0.95, bins = 10, align =
"mid", label = 'Number of Customers', color = [0, 0, 1, 1])
Linear regression
A regression model is a supervised learning model which predicts target variable that is continuous.
There are two types:
1. Simple Linear Regression (SLR)
2. Multiple Linear Regression (MLR) (i.e. multiple features/predictors to explain target
variable/dependent variable)
SLR
To first determine which dependent variables best predict the house price, do the following:
Scatter plotting of each independent variable against the dependant
Calculate the linear correlation between independent variable and the dependant
o Shortfall: Correlation scores only reflect the linear correlation between variables
o Example:
o correlations = br.corr()
o correlations['Price'] shows correlation of variables w.r.t Price
Attributing value to series in type
o e.g
#investigate the types of objects
print('type of y:', type(y))
print('type of y.values:', type(y.values)) # apply attribute values to Series
Statsmodels will pass a string as a formula and select the named columns from the dataframe.
One benefit is that you can call .summary() to print summary statistics of the fitted linear regression
model.
MLR
We use seaborn
import seaborn as sns
Note: if we just want to learn sth about the model, it is fine to have as much useful variables as
possible.
However, if we want to predict, it would actually be counter-productive to include ALL variables
(including the non-significant ones) as this would increase your variance.
Mathematical problem
Q1
As all x’s are iid, they have the same likelihood function (or density)
If there were multiple solutions, we would check that it is the maximum
Lecture 4: Regularisation and Going
Beyond Linearity
Why use regularisation?
Multicollinearity raises the following issues
−1
o Tractability – does ( X ' X ) exist?
o Inflate estimates of coefficients in linear regression and standard errors of
coefficients suggests model is unstable
To select variables from a high dimensional data set (i.e. more variables than observations)
How LASSO and Ridge work?
Penalises large coefficient magnitudes by using penalty terms
Regularisation: Ridge
Regularisation: LASSO
Option to also set some predictors to shrink to 0 faster than others – known as group lasso
Beyond linearity
Polynomial regression
Dataset example: wage and age
Step functions
Adv:
Easy to work with. Effectively creates a series of dummy variables representing each group.
Disadv:
Choice of cutpoints or knots can be problematic.
There may be differences within region.
Also discrete jumps may not be realistic. For creating nonlinearities, smoother alternatives
are available.
Piecewise polynomials
Splines
Splines are maximally continuous piecewise polynomials – at each of the knots, we have the
maximum amount of continuity
Linear splines
Cubic splines
To get rid of the kink (or second order discontinuity) in linear splines, we use cubic splines.
Why are splines useful?
Dimension reduction when interpreting a plot or when plotting
Adv: removes variance at edge points (see polynomial vs natural cubic splines below too)
o Linear beyond boundary knots
Knot placements
Strategies:
1. Divide sample into quantiles and place knots at each quantile
Introduces a penalty term to improve the smoothness of a fit, replaces the knot choice
Local regression
Adv
oAble to model each predictor withing a model that is most appropriate for that
relevant predictor
Other notes:
o Can fit a GAM simply using, e.g. natural splines
o Coefficients not that interesting; fitted functions are.
o Can mix terms - some linear, some nonlinear
o Can use smoothing splines or local regression as well
Tutorial 4
We want variables correlated with response but not highly correlated with each other to avoid
multicollinearity
Limitations of using heatmap for correlation is that it only shows you linear relationships so
you can look at pairplot() for non-linear relationships
Tutorial 5
Regularisation
Introducing bias to hopefully reduce variance – therefore it is useful if there is a model with
high variance
Useful function for splitting data into training/test sets
o from sklearn.model_selection import train_test_split
o train, test = train_test_split(data, test_size = 0.2,random_state=1)
Standardising the predictors
o Essentially getting the predictors into the same scale – note how vastly different
salary and children are
Calculate mean and s.t.d. and then subtract training mean from training AND
test sets then divide by training s.t.d
LASSO
How much the values go towards zero depends on α higher α is, closer coefficients are to
zero
o If the coefficient is close or is 0, then it essentially removes the predictors too
To choose α we can split into training and test set or do cross-validation (CV)
o Note: LassoCV function is the most convenient implementation with built in CV-
based model selection for tuning parameter .
Purpose of LASSO: go beyond linearity restriction to predict data; gives more flexibility to
predict
o Can predict a model not based on a particular trend (e.g. linear, etc)
Split into test and training sets to get a more accurate trend
If the coefficients of predictors go to zero, then this suggests those predictors should be
removed
Ridge
Difference from LASSO
o Never shrinks any value exactly to 0
o Adv: if we don’t want the feature selection in LASSO, generally has lower bias
ElasticNet
Does both LASSO and Ridge
o LASSO – L1 penalty
o Ridge – L2 penalty
o Penalty term for this: takes average of L1 and L2
Splines
By default, LSQUnivariateSpline fits a cubic spline, but you can specify parameter k to change
this (default k=3).
Must sort data first – requirement of spline functions
Compared to linear regression, the interpretability is a lot harder
GAMs
Adv: easy computation, can still visualise relationship of each predictor with variable; add
values for prediction
Disadv: doesn’t allow for complex relationship between x1, x2 and y – only complex
relationships between x1 and y and x2 and y separately –
o Only exception is to create an interaction term and set it as a new function and do
GAM on this
L() – linear regression
F() – for categorical variables
S() – spline
Quiz
Will not write code
But interpret output or the code
Assignment
Dummy variables –
o Have a fixed effect, or
o Interact term
Model use: prediction vs estimations
o Estimation – study of causation
LASSO result
o Remove ATM, downtown (both highly correlated to shops), and high
Lecture 5: Classification and statistical
decision theory
Business decision making scenarios often first require a classification
Goal of classification:
o ‘assign an object to exactly one category or class from a set of possible categories,
based on a set of observed measurements or features associated with that object’
Output variable Y takes value in a discrete set with C categories
Input variables are a vector X of predictors X1, . . . , Xp
A classifier is a prediction rule denoted by G(x) that, based on characteristics X = x of an
object, assigns the object into one of the C categories/classes.
So, c = G(x) if the object with characteristic x is classified into class c.
Loss function
Represented by a C x C loss matrix L
o Diagonal entries represent the correctly classified classes
o Non-diagonal entries represent misclassified classes
For a 0-1 Loss function,
o 0 will be on diagonal for correct classes and 1 on off-diagonals for incorrect classes
Bayes classifier
Bayes classifier is optimal under the 0-1 loss, i.e., it has the smallest prediction error than any other
classifiers (in other words, it has the smallest expected loss than other classifiers).
Decision boundary
We have two classes k and j and x is our decision boundary.
If there is an obs on the boundary, we are indifferent as to which class it is assigned to
Bayes classifier: binary case
So far we are currently assuming that the probabilities are true probabilities – however this is often
not the case so we need to estimate the probability
Bayes classifier
We ideally want to select a class that maximises pc ( x) conditional probability. Similar to regression
where we want a prediction based on f (x), which we need to estimate, we do a similar process but
for pc ( x) .
K-nearest neighbours classifier
We can either specify the distance to determine k-nearest neighbours according to: Euclidean
distance, or based on the number of observations/neighbours (K)
Adv:
o Easy to visualise for 2D
o Idea of this non-parametric method is to capture any complexities and non-linear
relationships
Disadv:
o Hard to visualise beyond 2D
Choosing the optimal K
Suppose the true boundary is the dotted line and the black line is the estimated boundary
Below when k = 1
The red shows discrete gaps – no smoothness in the boundaries – suggesting obs are being
overfitted
When k=100
Boundary line may be smoother, but it becomes more biased oversmoothing
Note: training error decreases with K – so K should be as small as possible to get perfect
classification in-sample, but this would lead to overfitting
2.
Logistic regression
We make the output above between 0 and 1 (i.e. normalising the output) gives the prob Y=1 given
x
The below function p( y ∨π ) means we get π when y=1 and 1−π when y=0. We express this in
the way below to be in a form of a probability density function of Bernoulli distribution.
Next, we find the log-likelihood
See Credit Card Default example on slide 26 for visualisation of classification – plots suggest balance
has a big impact on whether one defaults
If the errors ϵ d , ϵ n were normally distributed, it is possible with another distribution – but it
becomes difficult to handle when we have higher degrees or if there is more complexity
When interpreting coefficients β we want to interpret the difference of the coefficients, not
the coefficients themselves, which tells us if there is default or not
Confusion matrix
A binary classification problem with two categories
Category 1: positive
Category 0: negative
Instead of just assigning binary numbers, we can use generalised decision rule below (i.e.
suppose we are really concerned about the cost of fraud)
Loss matrix
Loss-based optimal decision rule
Finding τ for a specific loss matrix
Here, we want to avoid situation where Fraud loss is greater than investigation loss (i.e. LFN > LFP)
We also want this to be close to 1 – but we may care more about sensitivity than specificity
and vice versa depending on the problem
Imbalanced classes
Recommended readings:
Introduction to Statistical Learning – chapter 10
Elements of Statistical Learning – chapter 11
Deep Learning by Goodfellow, Bengio and Courville
Introduction
Neural networks and deep neural networks (called deep learning) – e.g. applied to image
recognition, language processing
Representation learning
We want to predict response Y based on raw/original covariates X with linear regression modelling
Often, before doing regression modelling, some appropriate transformation of the covariates Xi is
needed: Z1 = φ1(X), ...,Zd = φd(X).
Then we model
E(Y ∨X )=β 0 + β 1 Z 1+ ...+ β d Z d
Neural networks
Also called artificial neural network (ANN) is a computational model that is inspired by the network
of neurons in the human brain.
X – inputs
Z – neurons (or computational nodes)
Y – output
Neuron
Neuron collects inputs X , which are then weighted (i.e. the importance of an input) to get the net
input. We put that through an activation function and then an output.
Adv:
o Flexible and computationally efficient
For example, compared to higher degree polynomials: Evaluating polynomial
terms of high order can be computationally expensive and slow, particularly
when the degree of the polynomial increases. In contrast, many activation
functions are simple to compute making them faster and more efficient in
practice.
o Sparsity
Activation function determines whether a neuron “fires” or remains inactive.
E.g. In ReLU, nodes with negative net inputs will output zero, effectively
turning them off and introducing sparsity into the network. This can be
advantageous for computational efficiency and reducing overfitting.
Forward propagation
Where l are the hidden layers, h(l) is the activation function on for each S
Note: h( L+1 ) is the outward activation function which may or may not be linear
Neural net for regression
Example
Challenges in training?
Issues:
There is a huge number of parameters – since w is normally large
The surface of the loss function is often non-smooth and multimodal
Computational expensive
In most cases, neural net models are trained by the Stochastic Gradient Descent (SGD) method.
Suppose we start at an optimum, the slope will be close to 0 so the term below will be small.
The learning rate (a t ) is the hyperparameter for neural network training
How to determine the optimum number of iterations, which also affects how fast we
approach optimum?
NOTE: ‘Mathematically, if we have a cost function J(θ), where θ represents the parameters of our
model, GD updates these parameters in the opposite direction of the gradient of the function at the
current point:’
To reduce the cost of the operation, we use SDG by using mini-batches in m instead of the whole set
n.
There is a trade-off – more steps vs more accurate gradient – this trade-off determines how large m
would be.
Computing the gradient
Back propagation
We use the chain rule to compute the effect of each weight on our output, Y.
If the gradient varies a lot with steps, this suggests that we need to reduce the steps to improve
stability.
ADAM – adaptive moment – computes estimates of average gradient and its variance [which is used
to scale the gradient] often used now
Estimates average rate of gradient over time
Regularization in deep learning