0% found this document useful (0 votes)
12 views62 pages

Lecture 1: Introduction and Key Concepts

Uploaded by

6684059929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views62 pages

Lecture 1: Introduction and Key Concepts

Uploaded by

6684059929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Lecture 1: Introduction and Key

Concepts

Econometric theory
 Links mathematical, statistical and economic theory with data
 Key concepts: estimation and causal inference

Machine learning
 Designed to handle large datasets and complex relationships
 Focuses on prediction accuracy
 Key algorithms: neural networks, natural language processing models (NLP)
 Example: image recognition

Intersection between econometrics and machine learning


 Modelling complex relationships: ML techniques can uncover non-linear patterns.
 Regularization in econometrics: Methods like LASSO improve model selection and reduce
overfitting.
 Causal Machine Learning: Combining causal inference with high-dimensional ML models 
i.e. explaining why a model was used to your client

Textbooks
 ISL: “An introduction to statistical learning: with applications in Python” James, G., Witten, D.,
Hastie, T., Tibshirani, R., & Taylor, J. Springer-Verlag. https://www.statlearning.com/
– easy to read, not very advanced on theory, examples of Python code for practice.
 ESL: “The Elements of Statistical Learning” by Hastie, T., Tibshirani, R., Friedman, J. H., &
Friedman, J. H. https://hastie.su.domains/ElemStatLearn/
– theory-oriented, suitable for students with a sound math background, who want to
understand the theory better.

Maths notation
 We use upper case letters such as Y to denote random variables.
 Lower case letters denote observed values. For example, y denotes the realised value of the
random variable Y.
 We use i to index the observations, j to index the explanatory (independent) variables or
predictors, features or inputs. For example, yi is the dependent variable or response for
observation i, while xij is the value of predictor j for observation i.
 We use the hat notation for estimates.
 Vectors are in lower-case letters such as x. Matrices are in upper-case letters, X.

Key concepts

Supervised vs unsupervised learning


 Supervised (most of our analysis)

 PCA – principle component analysis (in econometrics)


 Unsupervised – more creative, but less used

Setting

Noise could be: measurement error, other factors that could influence the model that are not
accounted for (i.e. model imprecision)

Inference
 Question of which media is most effective is less about prediction but an inference that
applies the model to give advice
 This would require at least the properties of ^f

Estimation/training

Simplest way to approximate is the Taylor series expansion around 0

If x is bivariate, then…

 Interaction term is the highlighted term (i.e. constant?)


 Challenge: the terms increases as the number of predictors increase

 Noise – because we don’t know where the noise is located – i.e. a noise in Y or noise in x
o Application consequence: estimation will be accurate for this sample but perform
badly outside of the given sample
o See slide 22

Overfitting

How to estimate f(): parametric vs nonparametric methods


Parametric Methods
 Assume a specific functional form for the relationship (e.g., linear, quadratic).
 Estimation is based on a fixed number of parameters.
 Advantages: efficient with small data sets (few parameters to estimate); easy to interpret
and computationally inexpensive.
 Disadvantages: inflexible – incorrect model may lead to bias (e.g. linear model for nonlinear
relation).

Nonparametric Methods
 Make no strong assumptions about the functional form of the data.
 Flexibility increases as sample size grows.
 Advantages: flexible – can model complex relationships.
 Disadvantages: prone to overfitting, especially with small datasets; less interpretable and
computationally expensive with large datasets.

Flexibility vs interpretability
A main trade-off in econometrics – highly flexible, non-parametric methods tend to be less
interpretable than simpler methods

Assessing model accuracy: MSE

 Recall y i is actual value


 What is the difference between ^ y iand ^
f (x i)
 Training (or observation) data – in-sample – used for estimation
 Test data – out-of-sample
o One way to prevent overfitting and not used for estimation

MSE: training vs test data


 Red line – based on test MSE
 Grey line – based on training MSE
 Note: the more parameters we include the more flexible the model, however the poorer
properties for out-of-sampling

Conclusion from above graph: As we increase the number of input variables, the training MSE
reduces
 Test MSE reaches optimum at around 3

Bias vs variance trade-off

Var(e) – cannot be reduced without introducing a new model


Bias2 – tells us how badly we approximated fhat with respect to f
 To reduce the bias, we can increase the number of parameters (or order of polynomials) as
per the Taylor expansion series
Vartrain – variance relative to change in sample
 We want fhat to not be dependent on the sample so we want to have this var to be small

Essentially, if the variance is large (i.e. due to overfitting), there is less bias.
 As a result, we use MSE to resolve this trade-off because at the point where the MSE is the
lowest is where the bias-variance trade-off is also minimised (see in blue)
Lecture 2: Linear regression and
estimation techniques
Linear regression

Recall advertising data-set example

Simple linear regression


Y – sales
X – advertising

 We do this by trying the minimise the distance between the regression line and each data
point for X

Ordinary least squares


Goal: find β 0 and β 1 such that the RSS is minimised.
 RSS: minimises the sum of squares of e^
 If we don’t know the exact formula of RSS?, we look at gradient – where we then move
toward the inside?

Multiple linear regression


Now, we have a regression plan (no longer a line).
Matrix notation
There are:
 n – observations
 p – predictors
 p+1 – number of parameters
o Due to 1 parameter for β 0 too

Note: for now, think of all these as column vectors

Recall rules of matrix algebra


Least squares estimators

Residuals and errors in linear regression


Method of Moments (MM) Estimator
 Instead of minimising a certain criteria, match a certain sample moment to a population
moment (i.e. mean, variance, co-variance)
 Preferred over OLS in order to reduce the matrix calculation

 Advantages:
o Popular for a more complicated model
o Useful when you can easily simulate your data, but cannot create a formula

Maximum likelihood Estimator


Advantages
 Has the property of consistency – as the number of observations increases, you get closer to
your estimator
Disadvantages
 Biased estimator
Other properties

 Recall that ^β for MLE is the same as ^β for OLS – so we do not need to prove the estimator

Unbiasness property of the estimator


Omitted variable bias: ice cream sales and weight
Omitted variable bias is one source of endogeneity.

Another source of endogeneity is selection bias.


 e.g. only selecting weight of persons who work, excludes those who don’t work

Another source of bias is simultaneity, which does not explain what the causation relationship is
 e.g. Budget and GDP – i.e. does budget affect GDP or the other way around.

Variance property of the ^β estimator – variance-coveriance matrix

Gauss-Markov theorem
 Recall SE( ^β ) is the square root of variance-covariance matrix
 Distribution equals to t-distribution because we do not have the true value of SE (only an
estimation) – generally for smaller data sets

Standard goodness of fit measures

 F statistic – tries to see what is the difference between model under assumptions of normal
model and model you are estimating
Sales advertising example

Tutorial 2 notes
General idea: there is a problem, we have some data relating to the problem (where the data is
typically a sample of a larger population), we do modelling to either learn something about the
relationship between variables (e.g. policy analysis) or to predict something

Monte Carlo simulation


 Creating a data set ourselves (not yet modelling)
 Analyse relationship between labour and factory output
Object of statistical modelling: finding the true model as close as possible based on the dataset
values without ever knowing what the true model is

What degree is the best model to choose?


 For degree of 20, it gives us the smallest MSE – but we could always find a smaller MSE by
making the model more complicated
o However, goal is never to fit a model that perfectly fits the sample data because we
already know the sample data. We want to find a model that approximates the blue
line, not the red points.
o It overfitting
 For degree of 1, model is underfitting
 We want a model that captures the trend in the blue line, but not noise

Sample splitting
 Test on 1st sample, 2nd sample, etc and where it performs well on most of them, then we can
be reasonably confident that it is not underfitting nor overfitting
 Fit model on one part, then evaluate the performance of the model on the other part – tool
used to emulate generalising a model across different samples
o We also need to make sure the split samples are representative of the data set –
therefore, we cannot take observations systematically
 Does not necessarily mean that the result gives you an optimal model
 Strategy is not full proof because there is a lot of randomness – i.e. we can get completely
different results if we use a different seed

MSE plot
 Validation error for test MSE increasing after degree 10 – suggests there is an overfitting
when this MSE increasing
 There is underfitting, when there is a decrease in MSE from degree 1 and 2
 We want the validation error to be as small as possible – between degrees 2 and 10

Bias and variance


 Bias – source of error where model is systematically underestimating or overestimating
o Decreases with complexity
 Variance – error associated from the randomness in sampling
o Increases as model becomes more complex
 Idea is that we do the sampling for 1000 times
Lecture 3: Model Selection and
Regularisation
Model selection

Method of models
Loss functions
 We want to minimise this loss function in order to find an appropriate model
 Although quadratic loss is the most widely used, the others may be more useful depending
on the application
o Absolute loss for discrete variables
o Log-likelihood for forecasting

Expected loss

Bias and variance decomposition


 Recall that E ( f^ ( x 0 ) ) is a constant in this so there would not be E2 above

Train vs test expected loss

In the ideal case of rich-data, we can divide the data into two sets: training set and test set
• use the training set to fit/estimate the model and the test set to estimate the average loss.
• Pick up the model with the smallest prediction error
But...
• data is precious
• we can do better with cross-validation
• or criteria based on loss which penalize complexity – e.g. R^2 value

Information criteria

AIC - Akaike’s information criterion


Criteria: select the model with the smallest AIC

Theoretical justification

 Consider data (y,X), and model m depending a vector of parameters β of size p.


 Under model m, the density of the data is p(y,β), i.e. the likelihood function.
 Denote by g(y) the true density function of the data. Of course g(y) is unknown!
 Idea: We want a model such that some discrepancy between the truth g(y) and the model-
based p(y|θ) is small.

Let’s use Kullback-Leibler divergence to measure such discrepancy


 Kullback-Leibler (KL) divergence is widely used to measure the discrepancy between two
probability distributions.
 Consider two probability distributions with the corresponding density functions g(x) and p(x).
 Useful aspect of AIC
o As it is dependent on log-likelihood, we can rely on any distributions of the estimator

BIC - Bayesian information criterion

Cross-validation
 This does not make any assumptions as to the distribution

Leave one out cross-validation


 Advantage:
o Do not lose data – loose only 1
o Have n evaluations – as opposed to 20% for test data
 Disadvantage
o Have to repeat this n times
o For each evaluation, you only have 1, which may not be representative of the data
 To combat this, we have k-fold – selects k observations

K-fold

 The selected model is the one with the smallest CV prediction error.
 Typical choices of K are 5, 10 or n. The case K = n is known as leave-one-out cross-validation.
 Cross-validation is simple and widely used. However, CV can be sometimes very
computationally expensive because one has to fit the model many times.
 Benefit:
o Do not need to fit model as many times as LOOCV – only k times as opposed to n
times
Model/Variable selection

Variable selection in linear regression


 Consider a linear regression model or a logistic regression model with p potential covariates.
 At the initial step of modelling, a large number p of covariates is often introduced in order
to reduce potential bias.
 The task is then to select the best subset among these p variables

Best subset selection: Search over the totally 2p possible subsets of p covariates to find the best
subset. The criterion can be AIC, BIC or Cross-validation any other proper model selection criteria.
 Challenge with this is that 2p can be big very quickly and variable selection here is repetitive
as you take subsets until you find the best one

Example: credit dataset


Variable selection based on hypothesis testing

 Not very robust

Regularisation: Ridge

Problem: variance of beta is too large

−1
 Why is ( X ' X ) large?
o Recall we need to find determinant for inverse
 This is useful when you suspect you will have high variance (i.e. or if there is
multicollinearity)
 In this example, t is the radius of the sphere
 Effectively the blue circle makes the values of the parameters smaller

Note: we don’t estimate β 0 in this equation because it is essentially the mean of y


 λ is inversely related to t – so that as it increases, the radius of the blue circle decreases

 Useful to reduce the penalty due to more variables (which we want to prevent omitted
variable bias)

Special case when X variables are all uncorrelated


 Cannot use BIC or AIC so we use df instead of p

Tutorial 3
import pandas as pd
 Library for the following and for data manipulation

Read file into Python


e.g. marketing = pd.read_excel('DirectMarketing.xlsx')

Inspecting
 We can check the first 5 rows of the dataframe marketing by using the head function
e.g. marketing.head()

 To see the full information about a dataframe including number of entries and datatypes use
the info() function
e.g. marketing.info()

 To view the summary statistics of a dataframe, use the describe function


e.g. marketing.describe()

Indexing and Slicing


To get a subset, use the iloc() function with the syntax iloc[rows,columns]
 Note: index starts at 0
 e.g. Extracting first 3 rows of DM
o first_three = marketing.iloc[0:3, :]
o first_three.head()
 e.g. Subset
o subset = marketing.iloc[0:3, 5:10]
o subset.head()
 Extracting a column from a data frame
o e.g. catalogs = marketing['Catalogs']

Dropping a column
to_drop = ['Age', 'Gender', 'Married', 'Children']
no_demographic_info = marketing.drop(columns = to_drop)

Pyplot library
import matplotlib.pyplot as plt

The numpy.random.randn() function creates an array of specified shape and fills it with random
values as per standard normal distribution.

Line graph plt.plot()


Histogram Numerical data column

Example:
my_figure3 = plt.figure()
n, bins, patches = plt.hist(amount_spent, rwidth = 0.95, bins = 10, align =
"mid", label = 'Number of Customers', color = [0, 0, 1, 1])

# Colour is specified as Red, Green, Blue, Opacity


plt.xticks(bins)
plt.xlabel('Amount Spent')
plt.ylabel('Number of Customers')
plt.title('Histogram of Amount Spent')
plt.legend()

Linear regression
A regression model is a supervised learning model which predicts target variable that is continuous.
There are two types:
1. Simple Linear Regression (SLR)
2. Multiple Linear Regression (MLR) (i.e. multiple features/predictors to explain target
variable/dependent variable)

SLR
To first determine which dependent variables best predict the house price, do the following:
 Scatter plotting of each independent variable against the dependant
 Calculate the linear correlation between independent variable and the dependant
o Shortfall: Correlation scores only reflect the linear correlation between variables
o Example:
o correlations = br.corr()
o correlations['Price']  shows correlation of variables w.r.t Price
 Attributing value to series in type
o e.g
 #investigate the types of objects
 print('type of y:', type(y))
 print('type of y.values:', type(y.values)) # apply attribute values to Series

Estimating coefficients of SLRs

Defining model using statsmodel package

Statsmodels will pass a string as a formula and select the named columns from the dataframe.
One benefit is that you can call .summary() to print summary statistics of the fitted linear regression
model.

import statsmodels.formula.api as smf


If there is a heteroskedstic trend in the plot (i.e. variance increases with x) then we can use robust
standard errors.
 We use the ‘cov.type’ option
 Example:
o robust_slr = slr.get_robustcov_results(cov_type='HC0')
o robust_slr.summary()

MLR
We use seaborn
import seaborn as sns

Gives plot for each predictor with price


sns.pairplot(br)
plt.show()

Adding multiple predictors for MLS

Estimating MLR coefficients using explicit analytic formula

Note: if we just want to learn sth about the model, it is fine to have as much useful variables as
possible.
However, if we want to predict, it would actually be counter-productive to include ALL variables
(including the non-significant ones) as this would increase your variance.

Mathematical problem
Q1
 As all x’s are iid, they have the same likelihood function (or density)
 If there were multiple solutions, we would check that it is the maximum
Lecture 4: Regularisation and Going
Beyond Linearity
Why use regularisation?
 Multicollinearity raises the following issues
−1
o Tractability – does ( X ' X ) exist?
o Inflate estimates of coefficients in linear regression and standard errors of
coefficients  suggests model is unstable
 To select variables from a high dimensional data set (i.e. more variables than observations)
How LASSO and Ridge work?
 Penalises large coefficient magnitudes by using penalty terms

Elastic net is a weighted combination of penalty terms of Lasso and ridge


Regularisation: Ridge

Regularisation: LASSO

Previously, for ridge, the minimisation criteria was β 2, now it is |β|.


Lasso vs ridge
 Ridge
o Property that reduces MSE
 Lasso
o Has oracle property to select the right variables
 Compromise property – elastic net

 Option to also set some predictors to shrink to 0 faster than others – known as group lasso

Beyond linearity

Polynomial regression
Dataset example: wage and age

Step functions
Adv:
 Easy to work with. Effectively creates a series of dummy variables representing each group.
Disadv:
 Choice of cutpoints or knots can be problematic.
 There may be differences within region.
 Also discrete jumps may not be realistic. For creating nonlinearities, smoother alternatives
are available.

Piecewise polynomials

 Combines polynomials and step functions

Splines
Splines are maximally continuous piecewise polynomials – at each of the knots, we have the
maximum amount of continuity

Linear splines
Cubic splines

To get rid of the kink (or second order discontinuity) in linear splines, we use cubic splines.
Why are splines useful?
 Dimension reduction when interpreting a plot or when plotting

Natural cubic splines

 Adv: removes variance at edge points (see polynomial vs natural cubic splines below too)
o Linear beyond boundary knots

Knot placements

Strategies:
1. Divide sample into quantiles and place knots at each quantile

Note: Knot parameters


 cubic spline with K knots has K + 4 parameters or degrees of freedom
 natural spline with K knots has K degrees of freedom.
o Note: if there are K parameters, there should be K-1 knots
Smoothing splines

 Introduces a penalty term to improve the smoothness of a fit, replaces the knot choice

Local regression

 Adv: no kinks, reasonably smooth function


 Disadv: if h is too wide, you may have a smoother line (i.e. low variance) but deviate from
the true line (i.e. bias)  similar to a bias-variance trade-off
o Choosing h may be difficult
 Acts like a weighted OLS around a point

Generalised additive models


Incorporates different types of functional forms at different points of the regression

 Adv
oAble to model each predictor withing a model that is most appropriate for that
relevant predictor
 Other notes:
o Can fit a GAM simply using, e.g. natural splines
o Coefficients not that interesting; fitted functions are.
o Can mix terms - some linear, some nonlinear
o Can use smoothing splines or local regression as well

Tutorial 4
We want variables correlated with response but not highly correlated with each other to avoid
multicollinearity
 Limitations of using heatmap for correlation is that it only shows you linear relationships so
you can look at pairplot() for non-linear relationships

Best subset selection using AIC and BIC


 Compares every possible combination of predictors and determine which forms the best
model according to their AIC and BIC
 Testing by splitting the sample
o Disadv: the way the sample is split is totally random
 So we use cross-validation which does train-test split process multiple times
and taking the average
 5 or 10 is common
 Cross-validation
o Adv: gives you more confidence in reduced variability
Cross validation in non-parametric regression
 With high number of predictors, distance-based model (e.g. k-nearest neighbours) would no
longer be appropriate

Tutorial 5
Regularisation
 Introducing bias to hopefully reduce variance – therefore it is useful if there is a model with
high variance
 Useful function for splitting data into training/test sets
o from sklearn.model_selection import train_test_split
o train, test = train_test_split(data, test_size = 0.2,random_state=1)
 Standardising the predictors
o Essentially getting the predictors into the same scale – note how vastly different
salary and children are
 Calculate mean and s.t.d. and then subtract training mean from training AND
test sets then divide by training s.t.d
LASSO
 How much the values go towards zero depends on α  higher α is, closer coefficients are to
zero
o If the coefficient is close or is 0, then it essentially removes the predictors too
 To choose α we can split into training and test set or do cross-validation (CV)
o Note: LassoCV function is the most convenient implementation with built in CV-
based model selection for tuning parameter .
 Purpose of LASSO: go beyond linearity restriction to predict data; gives more flexibility to
predict
o Can predict a model not based on a particular trend (e.g. linear, etc)
 Split into test and training sets to get a more accurate trend
 If the coefficients of predictors go to zero, then this suggests those predictors should be
removed

Ridge
 Difference from LASSO
o Never shrinks any value exactly to 0
o Adv: if we don’t want the feature selection in LASSO, generally has lower bias
ElasticNet
 Does both LASSO and Ridge
o LASSO – L1 penalty
o Ridge – L2 penalty
o Penalty term for this: takes average of L1 and L2

Splines
 By default, LSQUnivariateSpline fits a cubic spline, but you can specify parameter k to change
this (default k=3).
 Must sort data first – requirement of spline functions
Compared to linear regression, the interpretability is a lot harder

GAMs
 Adv: easy computation, can still visualise relationship of each predictor with variable; add
values for prediction
 Disadv: doesn’t allow for complex relationship between x1, x2 and y – only complex
relationships between x1 and y and x2 and y separately –
o Only exception is to create an interaction term and set it as a new function and do
GAM on this
 L() – linear regression
 F() – for categorical variables
 S() – spline

Quiz
 Will not write code
 But interpret output or the code

Assignment

 Dummy variables –
o Have a fixed effect, or
o Interact term
 Model use: prediction vs estimations
o Estimation – study of causation
 LASSO result
o Remove ATM, downtown (both highly correlated to shops), and high
Lecture 5: Classification and statistical
decision theory
 Business decision making scenarios often first require a classification
 Goal of classification:
o ‘assign an object to exactly one category or class from a set of possible categories,
based on a set of observed measurements or features associated with that object’
 Output variable Y takes value in a discrete set with C categories
 Input variables are a vector X of predictors X1, . . . , Xp
 A classifier is a prediction rule denoted by G(x) that, based on characteristics X = x of an
object, assigns the object into one of the C categories/classes.
 So, c = G(x) if the object with characteristic x is classified into class c.

Decision theory for classification

Loss function
 Represented by a C x C loss matrix L
o Diagonal entries represent the correctly classified classes
o Non-diagonal entries represent misclassified classes
 For a 0-1 Loss function,
o 0 will be on diagonal for correct classes and 1 on off-diagonals for incorrect classes

Expected Prediction Loss


 We use out-of-sample observations to prevent overfitting

 We want to minimise this fraction of misclassified objects

Bayes classifier
Bayes classifier is optimal under the 0-1 loss, i.e., it has the smallest prediction error than any other
classifiers (in other words, it has the smallest expected loss than other classifiers).

Proof of why Bayes classifier has the smallest expected loss


Note:
L(Y , G ( x ) ) either be 1 or 0 – only equal to 0 when for the class ~
c and 1 for all other classes

Decision boundary
We have two classes k and j and x is our decision boundary.
 If there is an obs on the boundary, we are indifferent as to which class it is assigned to
Bayes classifier: binary case

So far we are currently assuming that the probabilities are true probabilities – however this is often
not the case so we need to estimate the probability

Empirical expected loss/error

Bayes classifier
We ideally want to select a class that maximises pc ( x) conditional probability. Similar to regression
where we want a prediction based on f (x), which we need to estimate, we do a similar process but
for pc ( x) .
K-nearest neighbours classifier

We can either specify the distance to determine k-nearest neighbours according to: Euclidean
distance, or based on the number of observations/neighbours (K)

 Adv:
o Easy to visualise for 2D
o Idea of this non-parametric method is to capture any complexities and non-linear
relationships
 Disadv:
o Hard to visualise beyond 2D
Choosing the optimal K
Suppose the true boundary is the dotted line and the black line is the estimated boundary

Below when k=10

Below when k = 1

 The red shows discrete gaps – no smoothness in the boundaries – suggesting obs are being
overfitted

When k=100
 Boundary line may be smoother, but it becomes more biased  oversmoothing

Steps to find optimal K


1. Split data into training and test samples

Note: training error decreases with K – so K should be as small as possible to get perfect
classification in-sample, but this would lead to overfitting
2.

Logistic regression

Consider a binary classification problem with two categories: 1 and 0.


Then, we can use logistic regression for estimating p1(x) = P(Y = 1 | X = x) and p0(x)=P(Y =0|X=x).

We make the output above between 0 and 1 (i.e. normalising the output)  gives the prob Y=1 given
x

The below function p( y ∨π ) means we get π when y=1 and 1−π when y=0. We express this in
the way below to be in a form of a probability density function of Bernoulli distribution.
Next, we find the log-likelihood

See Credit Card Default example on slide 26 for visualisation of classification – plots suggest balance
has a big impact on whether one defaults

Linear vs logistic regression


Linear one
 Probability goes less than 0
 Probability never reaches past 0.5
Logistic curvature suggests that at a certain point of around 0.5, as the slope of curve increases so
probability of default is more likely
Derivation
Logistic regression – rescale sth that looks like a linear regression to boundaries between 0 and 1

Property of logistic regression – models the log of odds ratio


 Log(p(selecting class 1)/ p(selecting class 0))
 Will have large coefficients if numerator is high and denominator is low – i.e. high probability
of selection y=1
 Log then transforms variable from 0 to infinity to -infinity to +infinity
 So log of odds ratio can model linear regression

 If the errors ϵ d , ϵ n were normally distributed, it is possible with another distribution – but it
becomes difficult to handle when we have higher degrees or if there is more complexity
 When interpreting coefficients β we want to interpret the difference of the coefficients, not
the coefficients themselves, which tells us if there is default or not

Example: Credit Cart Default


 Note: this shows that it is important to include key predictors – without them, the results
may be misleading

Multinomial logistic regression


We do a normalisation process
More on Binary Classification and Statistical Decision Theory

Confusion matrix
A binary classification problem with two categories
 Category 1: positive
 Category 0: negative

 Gives us true number of misclassifications along the diagonal


 Used to see how well your classification is performing – used on out of sample test data

Decision rule: other loss functions

 Instead of just assigning binary numbers, we can use generalised decision rule below (i.e.
suppose we are really concerned about the cost of fraud)

Generalised decision rule

Loss matrix
Loss-based optimal decision rule
Finding τ for a specific loss matrix

Here, we want to avoid situation where Fraud loss is greater than investigation loss (i.e. LFN > LFP)

Check 1:35 for sanity checks

Other key concepts

Sensitivity and specificity


Sensitivity (true positive rate)

 We want this to be close to 1

Specificity (true negative rate)

 We also want this to be close to 1 – but we may care more about sensitivity than specificity
and vice versa depending on the problem

False positive and false negative rates


 Sensitivity is related to the power of our classifier

Trade-off between sensitivity and specificity

Imbalanced classes

Trade-off between sensitivity and specificity: ROC curve


Lecture 6: Neural networks and Deep
learning
Introduction

Recommended readings:
 Introduction to Statistical Learning – chapter 10
 Elements of Statistical Learning – chapter 11
 Deep Learning by Goodfellow, Bengio and Courville

Introduction
 Neural networks and deep neural networks (called deep learning) – e.g. applied to image
recognition, language processing

Representation learning
We want to predict response Y based on raw/original covariates X with linear regression modelling
Often, before doing regression modelling, some appropriate transformation of the covariates Xi is
needed: Z1 = φ1(X), ...,Zd = φd(X).
Then we model
E(Y ∨X )=β 0 + β 1 Z 1+ ...+ β d Z d

Selection of the transformations φi(X) is an art!


Z = (Z1,...,Zd) is a representation of X = (X1,...,Xp).
A better representation (in terms of predicting Y ) leads to a better prediction accuracy.

 Neural network modeling can be thought as a representation learning method. It provides an


efficient way to design a representation Z = φ(X) that is effective for predicting the response
Y.
 Data representation learning is also important for unsupervised learning. We focus on neural
networks for supervised learning in this course.

Neural networks
Also called artificial neural network (ANN) is a computational model that is inspired by the network
of neurons in the human brain.
 X – inputs
 Z – neurons (or computational nodes)
 Y – output
Neuron
Neuron collects inputs X , which are then weighted (i.e. the importance of an input) to get the net
input. We put that through an activation function and then an output.

Note: the b in the Net Input function is the bias


 Can be treated like a prior

Neural networks: key definitions and mechanisms

A neural network is an interconnected assembly of (artificial) neurons, which communicate by


sending signals to each other over weighted connections
 A neural network is made of layers: an input layer, (one or many) hidden layers, and an
output layer.
 The input layer nodes receive data from outside the network. The output layer node(s) sends
data out of the network. Hidden layer nodes receive/process/send data within the network.
 The nodes are connected – each connection has a weight, w.
 A neural network is said to be deep, if it has many hidden layers. Deep neural network
modelling is collectively refereed to as deep learning.

Example of a deep neural network


 There may be more than 1 output node
 We can think of each of these arrows (i.e. weights) as predictors which need to similarly be
estimated

What are neural networks?


A neural network provides a mechanism for approximating a function:
 Suppose that Y = ftrue(X) + e, ftrue (X) is a true, yet unknown, function that we want to
estimate.
 ftrue (X) = E(Y|X) : the conditional mean of a response Y given X.
 A neural net with the output η(X) provides an approximation of ftrue (X).
 There are several variants of neural networks: feed-forward neural nets, convolutional neural
nets, recurrent neural nets. We focus mainly on feed-forward neural nets in this course.

Neural nets are multivariate functions -

How is Y =n ( X )=Z ( L+ 1)?

What happens in each hidden layer node?

Elements of a neural network


Activation functions
Activation functions introduce non-linearities into the model, allowing neural networks to learn
complex patterns and capture relationships in the data that go beyond linear transformations.

 Adv:
o Flexible and computationally efficient
 For example, compared to higher degree polynomials: Evaluating polynomial
terms of high order can be computationally expensive and slow, particularly
when the degree of the polynomial increases. In contrast, many activation
functions are simple to compute making them faster and more efficient in
practice.
o Sparsity
 Activation function determines whether a neuron “fires” or remains inactive.
E.g. In ReLU, nodes with negative net inputs will output zero, effectively
turning them off and introducing sparsity into the network. This can be
advantageous for computational efficiency and reducing overfitting.

Popular activation functions

 Output for ReLU will always be in the range (0,s)

Understanding neural sets


How does the above multiplication get the bias? (i.e. What is the relationship between bias and S – S
= bias + sum(weight X Z))

Forward propagation

Yes, this is because we assumed activation function is linear.

Where l are the hidden layers, h(l) is the activation function on for each S
Note: h( L+1 ) is the outward activation function which may or may not be linear
Neural net for regression
Example

Neural net for classification


 E.g. is the image a cat?

 Binary Y: Output of Y in this example is in the range (0,1)


 Multinomial Y: Output also is between (0,1)
o Note: this is because output is in terms of exp which is always greater than 0
o It is also a probability output so the sum of e s will be 1
k

Training neural sets

Other network architecture decisions:


 How to select the number of hidden layers?
 How to select the number of units in each hidden layer?
 How to select activation functions? etc.
Steps to train neural network
Get loss function, minimise loss function

Challenges in training?

Issues:
 There is a huge number of parameters – since w is normally large
 The surface of the loss function is often non-smooth and multimodal
 Computational expensive
In most cases, neural net models are trained by the Stochastic Gradient Descent (SGD) method.

Optimization with Gradient Descent

Suppose we start at an optimum, the slope will be close to 0 so the term below will be small.
 The learning rate (a t ) is the hyperparameter for neural network training

Example of optimising over 1D

 How to determine the optimum number of iterations, which also affects how fast we
approach optimum?

Stochastic gradient descent

NOTE: ‘Mathematically, if we have a cost function J(θ), where θ represents the parameters of our
model, GD updates these parameters in the opposite direction of the gradient of the function at the
current point:’

To reduce the cost of the operation, we use SDG by using mini-batches in m instead of the whole set
n.
There is a trade-off – more steps vs more accurate gradient – this trade-off determines how large m
would be.
Computing the gradient

Back propagation

We use the chain rule to compute the effect of each weight on our output, Y.

 To find δ (l ) we need to know δ (l +1)  hence back propagation


Rewatch chain rule at around 1:35 for back propogation

Selecting the learning rate


Determines the efficiency of convergence - a t should be designed as follows:
 Component-dependent: each component of the gradient vector needs a different step size,
i.e., at should be a vector, not a scalar
 Adaptive: select at based on the behaviour of w(0), w(1), . . . , w(t−1)

If the gradient varies a lot with steps, this suggests that we need to reduce the steps to improve
stability.

AdaGrad (2011) vs ADAM (2014)?

AdaGrad – adaptive gradient – accumulates squared gradient as a proxy for variability


 If G is large, it suggests high variability so we want to reduce the learning rate

ADAM – adaptive moment – computes estimates of average gradient and its variance [which is used
to scale the gradient]  often used now
 Estimates average rate of gradient over time
Regularization in deep learning

As regularisation is normal when we have a lot of parameters

You might also like