Session 02 - Regression - and - Classification
Session 02 - Regression - and - Classification
In this session
• We will learn fundamental aspects about regression and classification models.
1
26/06/2019
We should expect a model that predicts sales based on TV, radio, and
newspaper predictor variables:
𝑠𝑎𝑙𝑒𝑠 = 𝑓 𝑇𝑉, 𝑟𝑎𝑑𝑖𝑜, 𝑛𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟
[*] http://www-bcf.usc.edu/~gareth/ISL/data.html
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3
Regression 400
Price?
• The goal of regression is to predict
Price (£) in 1000’s
300
the value of one or more
continuous target variables 𝑦 200
given the value of a p-dimensional
vector x of input variables. 100
Size = 1080
0
0 500 1000 1500 2000 2500
Size in feet2
2
26/06/2019
• where 𝛽0 and 𝛽1 are two unknown constants that represent the intercept and slope, also
known as coefficients or parameters, and 𝜖 is the error term.
• The least squares approach chooses 𝛽መ0 and 𝛽መ1 to minimise the RSS. That is, we solve 𝛽መ0 and
𝛽መ1 from 𝜕𝑅𝑆𝑆 𝛽መ0 , 𝛽መ1 Τ𝜕𝛽መ0 = 0, and 𝜕𝑅𝑆𝑆 𝛽መ0 , 𝛽መ1 Τ𝜕𝛽መ1 = 0, respectively.
• The minimising values can be shown to be:
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝛽መ1 = , 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
1 1
• Where 𝑦ത = 𝑛 σ𝑛𝑖=1 𝑦𝑖 , and 𝑥ҧ = 𝑛 σ𝑛𝑖=1 𝑥𝑖
3
26/06/2019
𝑌 = 𝛽0 + 𝑋𝑗 𝛽𝑗 + 𝜖
𝑗=1
• Or, if 1 is included in 𝑋,
𝑃
𝑌 = 𝑋𝑗 𝛽𝑗 + 𝜖 = 𝑋 𝑇 𝛽 + 𝜖
𝑗=0
• Predicting a new output variable value, assuming 𝛽መ is
already estimated, is given by
𝑌 = 𝑋 𝑇 𝛽መ
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7
4
26/06/2019
𝑛
1 1 2
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦ො𝑖
𝑛−2 𝑛−2
𝑖=1
10
5
26/06/2019
11
𝑌 = 𝛽0 + 𝑋𝑗 𝛽𝑗 + 𝜖
𝑗=1
• We interpret 𝛽𝑗 as the average effect on 𝑌 of a one unit increase in 𝑋𝑗 , holding all other
predictors fixed.
12
6
26/06/2019
Issues
• The ideal scenario is when the predictors are uncorrelated – a balanced design:
• Each coefficient can be estimated and tested separately.
• Interpretations such as “a unit change in 𝑋𝑗 is associated with a 𝛽𝑗 change in 𝑌 , while all
the other variables stay fixed”, are possible.
• Correlations amongst predictors cause problems:
• The variance of all coefficients tends to increase, sometimes dramatically.
• Interpretations become hazardous – when 𝑋𝑗 changes, everything else changes.
13
Categorical predictors
• A categorical (or qualitative, or factor) predictor takes categorical values (i.e. levels with no
particular order) only.
• Examples: gender (female, male), marital status (single, married, etc), ethnicity
(Caucasian, African American, Asian).
14
7
26/06/2019
15
16
8
26/06/2019
Polynomial regression
• General form:
17
Piecewise polynomials
• Instead of a single polynomial in 𝑋
over its whole domain, we can rather
use different polynomials in regions
denoted by knots.
18
9
26/06/2019
• It is called an additive model because we calculate a separate 𝑓𝑗 for each 𝑋𝑗 and then add
together all of their contributions.
• e.g. let’s assume the following model
• Where year and age are numeric, and education, categorical with levels (“<HS”, ”HS”,
“<Coll”, “Coll”)
19
20
10
26/06/2019
21
Classification
• Qualitative variables take values in an unordered set 𝒞, such as:
• Given a feature vector 𝑋 and a qualitative response 𝑌 taking values in the set 𝒞, the
classification task is to build a function 𝐶 𝑋 that takes as input the feature vector 𝑋 and
predicts its value for 𝑌; i.e. 𝐶 𝑋 ∈ 𝒞.
• Often we are more interested in estimating the probabilities that 𝑋 belongs to each
category in 𝐶.
• For example, it is more valuable to have an estimate of the probability that an insurance
claim is fraudulent, than a classification fraudulent or not.
22
11
26/06/2019
23
Logistic regression
• Let's write 𝑝 𝑋 = Pr 𝑌 = 1 𝑋 for short and consider using balance to predict default.
• Logistic regression uses the form
(𝑒 ≈ 2.71828 is the Euler's number.)
24
12
26/06/2019
25
Making predictions
26
13
26/06/2019
Making predictions
27
28
14
26/06/2019
Multinomial regression
• So far we have discussed logistic regression with two classes. It is easily generalised to more
than two classes. One version (used in the R package glmnet) has the symmetric form
29
Bayes' theorem
• Bayes' theorem is stated as the following equation:
𝑃 𝐵𝐴 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵
Where 𝐴 and 𝐵 are events, and 𝑃(𝐵) ≠ 0
• 𝑃(𝐴) and 𝑃(𝐵) are the probabilities of observing 𝐴 and 𝐵 without regard to each other
• 𝑃 𝐴 𝐵 a conditional probability, is the probability of observing event 𝐴 given that 𝐵 is true.
• 𝑃 𝐵 𝐴 is the probability of observing event 𝐵 given that 𝐴 is true.
• Bayes' theorem is the key to using new observations to modify prior beliefs.
30
15
26/06/2019
𝑃 𝑒𝐻 𝑃 𝐻
𝑃 𝐻𝑒 =
𝑃 𝑒
31
How the prior has an impact on the evidence, which is reflected in the posterior
32
16
26/06/2019
33
Discriminant Analysis
• Here the approach is to model the distribution of 𝑋 in each of the classes separately, and
then use Bayes theorem to flip things around and obtain Pr(𝑌|𝑋).
• When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic
discriminant analysis.
• However, this approach is quite general, and other distributions can be used as well. We will
focus on normal distributions.
34
17
26/06/2019
Pr 𝑋 = 𝑥 𝑌 = 𝑘) Pr 𝑌 = 𝑘
Pr 𝑌 = 𝑘 𝑋 = 𝑥 =
Pr 𝑋 = 𝑥
𝜋𝑘 𝑓𝑘 𝑥
Pr 𝑌 = 𝑘 𝑋 = 𝑥 =
σ𝐾
𝑙=1 𝜋𝑙 𝑓𝑙 𝑥
• 𝑓𝑘 (𝑥) = Pr 𝑋 = 𝑥 𝑌 = 𝑘 is the density for 𝑋 in class 𝑘. Here we will use normal densities
for these, separately in each class.
• 𝜋𝑘 = Pr 𝑌 = 𝑘 is the marginal or prior probability for class 𝑘.
35
36
18
26/06/2019
37
• When 𝑓𝑘 𝑥 are Gaussian densities, with the same covariance matrix in each class, this
leads to linear discriminant analysis.
• By altering the forms for 𝑓𝑘 𝑥 , we get different classifiers.
• With Gaussians but different 𝚺𝑘 in each class, we get quadratic discriminant analysis.
𝑝
• With 𝑓𝑘 𝑥 = ς𝑗=1 𝑓𝑗𝑘 𝑥𝑗 (conditional independence model) in each class we get
naive Bayes. For Gaussian this means the 𝚺𝑘 are diagonal.
• Many other forms, by proposing specific density models for 𝑓𝑘 𝑥 , including
nonparametric approaches.
38
19
26/06/2019
39
40
20
26/06/2019
• Algorithm:
1. Given a query point 𝑥0, find the k
training points 𝑥(𝑟) , 𝑟 = 1, … , 𝑘
closest in distance to 𝑥0.
2. Classify using majority vote among
the k neighbours.
41
Decision boundaries
• K-NN algorithm does not explicitly compute decision Decision boundary of a 2-class 2-
boundaries. dimensional problem using 7-NN
• The more examples that are stored, the more
complex the decision boundaries can become.
• K-NN heavily suffers from the curse of
dimensionality
• Suppose we have 5000 points uniformly distributed
in the unit hypercube and we want to apply the 5-
NN.
• Suppose our query point is at the origin:
• 1D – on a one dimensional line, we must go a
distance of 5/5000=0.001 on average to capture
the 5 nearest neighbours.
• 2D – in two dimensions, we must go 0.001 to
get a square that contains 0.001 of the volume.
• pD – in p dimensions, we must go 0.0011/𝑝 !
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 42
42
21
26/06/2019
K-Nearest Neighbours
• Advantages: • Disadvantages:
• Simple technique that is easily • Classifying unknown records are
implemented. relatively expensive
• Building model is inexpensive . • Requires distance computation of k-
• Extremely flexible classification nearest neighbours
scheme • Computationally intensive,
• does not involve preprocessing especially when the size of the
• Well suited for training set grows
• Multi-modal classes (classes of • Accuracy can be severely degraded by
multiple forms) the presence of noisy or irrelevant
features
• Records with multiple class labels
• NN classification expects class
• Asymptotic Error rate at most twice conditional probability to be locally
Bayes rate constant
• Cover & Hart paper (1967) • bias of high dimensions
• Can sometimes be the best method
43
Summary
1. We learnt about regression models and several regression methods: simple and
multiple linear regression, and non-linear regression (polynomial, splines, GAMs)
3. We also learnt how to interpret the coefficients of linear and logistic regression
models.
44
22