0% found this document useful (0 votes)
49 views22 pages

Session 02 - Regression - and - Classification

1) This session covers fundamental aspects of regression and classification models, including predicting sales based on advertising spending data using regression models. 2) Regression aims to predict a continuous target variable given input variables, while classification predicts a discrete class. Simple linear regression fits a linear relationship between a single input and output, while multiple regression uses multiple inputs. 3) Parameters in regression models are estimated to minimize the residual sum of squares between predicted and actual target values in the training data. This process fits the model to best predict new examples based on their input features.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views22 pages

Session 02 - Regression - and - Classification

1) This session covers fundamental aspects of regression and classification models, including predicting sales based on advertising spending data using regression models. 2) Regression aims to predict a continuous target variable given input variables, while classification predicts a discrete class. Simple linear regression fits a linear relationship between a single input and output, while multiple regression uses multiple inputs. 3) Parameters in regression models are estimated to minimize the residual sum of squares between predicted and actual target values in the training data. This process fits the model to best predict new examples based on their input features.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

26/06/2019

Session 2 – Regression and


Classification
Dr Ivan Olier
[email protected]

ECI – International Summer School /


Machine Learning
2019

In this session
• We will learn fundamental aspects about regression and classification models.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
26/06/2019

Example: Advertising data


• Consider the following dataset[*], which was collected to predict impact of advertising on
sales:
• Is there a relationship between
advertising budget and sales?
TV radio newspaper sales
• How strong is the relationship between
230.1 37.8 69.2 22.1
advertising budget and sales?
44.5 39.3 45.1 10.4
• Which media contribute to sales?
17.2 45.9 69.3 9.3
• How accurately can we predict future
151.5 41.3 58.5 18.5
sales?
180.8 10.8 58.4 12.9
• Is the relationship linear?
… … … … • Is there synergy among the advertising
media?

We should expect a model that predicts sales based on TV, radio, and
newspaper predictor variables:
𝑠𝑎𝑙𝑒𝑠 = 𝑓 𝑇𝑉, 𝑟𝑎𝑑𝑖𝑜, 𝑛𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟
[*] http://www-bcf.usc.edu/~gareth/ISL/data.html
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

Regression 400
Price?
• The goal of regression is to predict
Price (£) in 1000’s

300
the value of one or more
continuous target variables 𝑦 200
given the value of a p-dimensional
vector x of input variables. 100

Size = 1080
0
0 500 1000 1500 2000 2500

Size in feet2

A multivariate regression example


(2-dimensional inputs)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 4

2
26/06/2019

Simple (Univariate) linear regression


• We assume a model
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖

• where 𝛽0 and 𝛽1 are two unknown constants that represent the intercept and slope, also
known as coefficients or parameters, and 𝜖 is the error term.

• Given some estimates 𝛽መ0 and 𝛽መ1 for the 400


model coefficients, we predict a new

Price (£) in 1000’s


300
response 𝑦ො of a new input value 𝑥 using Price=225
200

𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥 100


Size = 1080
0
• The hat symbol denotes an estimated value 0 500 1000 1500 2000 2500
Size in feet2

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Simple linear regression – Parameter estimation


• Let 𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 be the prediction for 𝑌 based on the 𝑖th value of 𝑋. Then 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
represents the 𝑖th residual.
• We define the residual sum of squares (RSS) as:
2 2
𝑅𝑆𝑆 = 𝑒12 + 𝑒22 + ⋯ + 𝑒𝑛2 = 𝑦1 − 𝛽መ0 + 𝛽መ1 𝑥𝑖 + ⋯ + 𝑦𝑛 − 𝛽መ0 + 𝛽መ1 𝑥𝑛

• The least squares approach chooses 𝛽መ0 and 𝛽መ1 to minimise the RSS. That is, we solve 𝛽መ0 and
𝛽መ1 from 𝜕𝑅𝑆𝑆 𝛽መ0 , 𝛽መ1 Τ𝜕𝛽መ0 = 0, and 𝜕𝑅𝑆𝑆 𝛽መ0 , 𝛽መ1 Τ𝜕𝛽መ1 = 0, respectively.
• The minimising values can be shown to be:
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝛽መ1 = , 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2

1 1
• Where 𝑦ത = 𝑛 σ𝑛𝑖=1 𝑦𝑖 , and 𝑥ҧ = 𝑛 σ𝑛𝑖=1 𝑥𝑖

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
26/06/2019

Multiple (Multivariate) linear regression model


• Now, let’s assume 𝑋 𝑇 a vector of inputs 𝑋1 , … , 𝑋𝑃 .
A multiple linear regression model is defined as
𝑃

𝑌 = 𝛽0 + ෍ 𝑋𝑗 𝛽𝑗 + 𝜖
𝑗=1

• Or, if 1 is included in 𝑋,
𝑃

𝑌 = ෍ 𝑋𝑗 𝛽𝑗 + 𝜖 = 𝑋 𝑇 𝛽 + 𝜖
𝑗=0
• Predicting a new output variable value, assuming 𝛽መ is
already estimated, is given by

𝑌෠ = 𝑋 𝑇 𝛽መ
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

Multiple linear regression – Estimation of parameters


• How do we fit the multiple linear model to a set of training data? Again, least squares is an
option.
• It estimates 𝛽መ by minimising the residual sum of squares
𝑛 𝑛
2 𝑇
𝑅𝑆𝑆 𝛽መ = ෍ 𝜖𝑖2 = ෍ 𝑦𝑖 − 𝑥𝑖𝑇 𝛽መ = 𝐲 − 𝐗𝛽መ 𝐲 − 𝐗𝛽መ
𝑖=1 𝑖=1

• Where 𝐗, 𝐲 is the training dataset, 𝐗, an 𝑛 × 𝑝 matrix, 𝐲, an 𝑛-vector of outputs.


𝑇
• … and 𝛽መ = 𝛽መ0 , 𝛽መ1 , … , 𝛽መ𝑝 a vector of model parameters.
• Since RSS is a quadratic function of the parameters, there is the guarantee of at least a local
minimum, although it may not be unique. So, differentiating w.r.t. 𝛽, equating to 0 and
solving 𝛽, we get
𝛽መ = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝐲
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
26/06/2019

Example – Advertising data


• Suggested multiple linear regression model:
𝑠𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 ∙ 𝑇𝑉 + 𝛽2 ∙ 𝑟𝑎𝑑𝑖𝑜 + 𝛽3 ∙ 𝑛𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟 + 𝜖

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Performance metrics for regression


• Residual standard error – average amount that the response will deviate from the true.

𝑛
1 1 2
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛−2 𝑛−2
𝑖=1

• R-squared – fraction of variance explained:


𝑅𝑆𝑆
𝑅2 = 1 −
𝑇𝑆𝑆

Where 𝑇𝑆𝑆 = σ𝑁 ത 2 is the total sum of squares.


𝑖=1 𝑦𝑖 − 𝑦
• If the regression model is simple linear, then 𝑅2 = 𝑟 2 , where 𝑟 is the correlation between 𝑋
and 𝑌:

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

10

5
26/06/2019

Some important questions


1. Is at least one of the predictors 𝑋1 , 𝑋2 , … , 𝑋𝑝 useful in predicting the response?
2. Do all the predictors help to explain 𝑌 , or is only a subset of the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate
is our prediction?

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

11

Interpreting regression coefficients


• A multiple regression model
𝑃

𝑌 = 𝛽0 + ෍ 𝑋𝑗 𝛽𝑗 + 𝜖
𝑗=1

• We interpret 𝛽𝑗 as the average effect on 𝑌 of a one unit increase in 𝑋𝑗 , holding all other
predictors fixed.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

12

6
26/06/2019

Issues
• The ideal scenario is when the predictors are uncorrelated – a balanced design:
• Each coefficient can be estimated and tested separately.
• Interpretations such as “a unit change in 𝑋𝑗 is associated with a 𝛽𝑗 change in 𝑌 , while all
the other variables stay fixed”, are possible.
• Correlations amongst predictors cause problems:
• The variance of all coefficients tends to increase, sometimes dramatically.
• Interpretations become hazardous – when 𝑋𝑗 changes, everything else changes.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

13

Categorical predictors
• A categorical (or qualitative, or factor) predictor takes categorical values (i.e. levels with no
particular order) only.
• Examples: gender (female, male), marital status (single, married, etc), ethnicity
(Caucasian, African American, Asian).

• How to code categorical predictors?


• WRONG WAY – Create a numerical variable, assign an integer value to each level. This
wrongly creates an order between the levels.
• CORRECT WAY – Having 𝑚 levels, create (𝑚 − 1) binary variables, one for each level.
Each variable can take 1 to represent a corresponding level. There is one level without
variable, it is represented when all the binary variables take 0. It is called baseline.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

14

7
26/06/2019

Example – Coding BMI


• Let’s assume we have a categorical variable for BMI with levels:
BMI
Underweight Normal Overweight Obese
• It must be coded using three new binary variables as follows:
New binary variable value meaning
BMI[Normal] 1 BMI is normal
0 BMI is not normal
BMI[Overweight] 1 BMI is overweight
0 BMI is not overweight
BMI[Obese] 1 BMI is obese
0 BMI is not obese
• The baseline, which is not a variable, represents the absence of the coded levels.
Baseline (not a BMI[Normal] = 0, BMI is neither normal,
variable) BMI[Overweight] = 0, & nor overweight, nor
BMI[Obese] = 0 obese.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

15

Non-linear regression models


• The truth is never linear! Or almost never!
• But often the linearity assumption is good enough.

When its not …


• polynomials,
• step functions,
• splines,
• local regression, and
• generalised additive models
offer a lot of flexibility, without losing the ease and interpretability of linear models.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

16

8
26/06/2019

Polynomial regression
• General form:

Degree-4 polynomial • Create new variables 𝑋1 = 𝑋, 𝑋2 = 𝑋 2 , etc and then


treat as multiple linear regression.
• Coefficient values are less relevant. Predictions only:

• Confidence intervals are estimated point-wise:

• We either fix the degree d at some reasonably low value,


else use cross-validation to choose d.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

17

Piecewise polynomials
• Instead of a single polynomial in 𝑋
over its whole domain, we can rather
use different polynomials in regions
denoted by knots.

𝛽01 + 𝛽11 𝑥𝑖 + 𝛽21 𝑥𝑖2 + ⋯ + 𝜖 𝑥𝑖 < 𝑐


𝑦𝑖 = ൝
𝛽02 + 𝛽12 𝑥𝑖 + 𝛽22 𝑥𝑖2 + ⋯ + 𝜖 𝑥𝑖 ≥ 𝑐

• Better to add constraints to the


polynomials, e.g. continuity.
• Splines have the “maximum” amount
of continuity.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

18

9
26/06/2019

Generalised additive models (GAM)


• Allows for flexible nonlinearities in several variables, but retains the additive structure of
linear models.

𝑦𝑖 = 𝛽0 + 𝑓1 𝑥𝑖1 + 𝑓2 𝑥𝑖2 + ⋯ + 𝑓𝑝 𝑥𝑖𝑝 + 𝜖

• It is called an additive model because we calculate a separate 𝑓𝑗 for each 𝑋𝑗 and then add
together all of their contributions.
• e.g. let’s assume the following model

wage = 𝛽0 + 𝑓1 year + 𝑓2 age + 𝑓3 education + 𝜖

• Where year and age are numeric, and education, categorical with levels (“<HS”, ”HS”,
“<Coll”, “Coll”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

19

Generalised additive models (GAM)


• Fitting year and age with natural splines, and education uses a step function.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

20

10
26/06/2019

Some considerations with GAMs


• GAMs allow us to fit a non-linear 𝑓𝑗 to each 𝑋𝑗 , so that we can automatically model non-
linear relationships that standard linear regression will miss.
• The non-linear fits can potentially make more accurate predictions for the response 𝑌.
• Because the model is additive, we can still examine the effect of each 𝑋𝑗 on 𝑌 individually.
• The main limitation of GAMs is that the model is restricted to be additive. With many
variables, important interactions can be missed.
• However, as with linear regression, we can manually add interaction terms to the GAM
model by including additional predictors of the form 𝑋𝑗 × 𝑋𝑘 .
• In addition, we can add low-dimensional interaction functions of the form 𝑓𝑗𝑘 𝑋𝑗 , 𝑋𝑘 into
the model.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

21

Classification
• Qualitative variables take values in an unordered set 𝒞, such as:

• Given a feature vector 𝑋 and a qualitative response 𝑌 taking values in the set 𝒞, the
classification task is to build a function 𝐶 𝑋 that takes as input the feature vector 𝑋 and
predicts its value for 𝑌; i.e. 𝐶 𝑋 ∈ 𝒞.
• Often we are more interested in estimating the probabilities that 𝑋 belongs to each
category in 𝐶.
• For example, it is more valuable to have an estimate of the probability that an insurance
claim is fraudulent, than a classification fraudulent or not.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

22

11
26/06/2019

Example: Credit Card Default


Can we predict whether a costumer will default?

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

23

Logistic regression
• Let's write 𝑝 𝑋 = Pr 𝑌 = 1 𝑋 for short and consider using balance to predict default.
• Logistic regression uses the form
(𝑒 ≈ 2.71828 is the Euler's number.)

• It is easy to see that no matter what values 𝛽0 , 𝛽1 Logit function


or 𝑋 take, 𝑝(𝑋) will have values between 0 and 1.
• A bit of rearrangement gives

• This monotone transformation is called the log


odds or logit transformation of 𝑝(𝑋). (by log we
mean natural log: ln.)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

24

12
26/06/2019

Probabilities and odds


• The odds of an event are commonly used in betting circles.
• For example, a bookmaker may offer odds of 10 to 1 that Arsenal Football Club will be
champions of the Premiership this season.
• This means that the bookmaker considers the probability that Arsenal will not be
champions is 10 times the probability that they will be.
• Odds and probabilities:
• The odds of event A are defined as the probability that A does happen divided by the
probability that it does not happen:
Pr(𝐴)
Odds 𝐴 =
1 − Pr 𝐴
• Odd ratios (ORs):
Odds(𝐴)
OR =
Odds(𝐵)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 25

25

Making predictions

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

26

13
26/06/2019

Making predictions

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

27

Multiple logistic regression

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

28

14
26/06/2019

Multinomial regression
• So far we have discussed logistic regression with two classes. It is easily generalised to more
than two classes. One version (used in the R package glmnet) has the symmetric form

• Here there is a linear function for each class.


• Only 𝐾 − 1 linear functions are needed as in 2-class logistic regression.
• Multiclass logistic regression is also referred to as multinomial regression.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

29

Bayes' theorem
• Bayes' theorem is stated as the following equation:

𝑃 𝐵𝐴 𝑃 𝐴
𝑃 𝐴𝐵 =
𝑃 𝐵
Where 𝐴 and 𝐵 are events, and 𝑃(𝐵) ≠ 0

• 𝑃(𝐴) and 𝑃(𝐵) are the probabilities of observing 𝐴 and 𝐵 without regard to each other
• 𝑃 𝐴 𝐵 a conditional probability, is the probability of observing event 𝐴 given that 𝐵 is true.
• 𝑃 𝐵 𝐴 is the probability of observing event 𝐵 given that 𝐴 is true.

• Bayes' theorem is the key to using new observations to modify prior beliefs.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

30

15
26/06/2019

Bayes' theorem 𝐻: Hypothesis


𝑒: evidence

Likelihood Prior probability


How probable is the evidence How probable was our hypothesis
given that our hypothesis is true? before observing the evidence?

𝑃 𝑒𝐻 𝑃 𝐻
𝑃 𝐻𝑒 =
𝑃 𝑒

Posterior probability Marginal likelihood


How probable is our hypothesis How probable is the new evidence
given the observed evidence? under all possible hypotheses?
(Not directly computable) 𝑃 𝑒 = ෍ 𝑃 𝑒 𝐻𝑖 𝑃(𝐻𝑖 )
𝑖

Posterior probability  Prior probability x Likelihood

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

31

Bayes – understanding the posterior

How the prior has an impact on the evidence, which is reflected in the posterior

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

32

16
26/06/2019

Naïve Bayes model It is typically used for classification


- Taking an instance for which we have
Naïve Bayes model observed a number of features 𝑥1 … 𝑥𝑛
- Goal: infer to which class the particular
Class Hidden instance belongs to
- Assumptions: every set of features 𝑥𝑖 , 𝑥𝑗
are conditionally independent given the
𝑥1 𝑥2 … 𝑥𝑛 class:
𝑥𝑖 ⊥ 𝑥𝑗 𝐶) for all 𝑥𝑖 , 𝑥𝑗
Observed

The distribution assumed over


𝑃 𝐶 𝑃 𝑥1 , … , 𝑥𝑛 |𝐶 𝑃 𝑥𝑖 |𝐶 will give the different
𝑃 𝐶|𝑥1 , … , 𝑥𝑛 = Naïve Bayes models, e.g.
𝑃(𝑥1 , … , 𝑥𝑛 ) Bernoulli, Multinomial, etc.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

33

Discriminant Analysis
• Here the approach is to model the distribution of 𝑋 in each of the classes separately, and
then use Bayes theorem to flip things around and obtain Pr(𝑌|𝑋).
• When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic
discriminant analysis.
• However, this approach is quite general, and other distributions can be used as well. We will
focus on normal distributions.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

34

17
26/06/2019

Bayes theorem for discriminant analysis


• Bayes theorem can be used to estimate Pr(𝑌|𝑋) as follows:

Pr 𝑋 = 𝑥 𝑌 = 𝑘) Pr 𝑌 = 𝑘
Pr 𝑌 = 𝑘 𝑋 = 𝑥 =
Pr 𝑋 = 𝑥

• One writes this slightly differently for discriminant analysis:

𝜋𝑘 𝑓𝑘 𝑥
Pr 𝑌 = 𝑘 𝑋 = 𝑥 =
σ𝐾
𝑙=1 𝜋𝑙 𝑓𝑙 𝑥

• 𝑓𝑘 (𝑥) = Pr 𝑋 = 𝑥 𝑌 = 𝑘 is the density for 𝑋 in class 𝑘. Here we will use normal densities
for these, separately in each class.
• 𝜋𝑘 = Pr 𝑌 = 𝑘 is the marginal or prior probability for class 𝑘.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 35

35

Classify to the highest density

• We classify a new point according to which density is highest.


• When the priors are different, we take them into account as well, and compare 𝜋𝑘 𝑓𝑘 𝑥 . On
the right, we favour the pink class – the decision boundary has shifted to the left.
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

36

18
26/06/2019

Why discriminant analysis?


• When the classes are well-separated, the parameter estimates for the logistic regression
model are surprisingly unstable. Linear discriminant analysis does not suffer from this
problem.
• If 𝑛 is small and the distribution of the predictors 𝑋 is approximately normal in each of the
classes, the linear discriminant model is again more stable than the logistic regression
model.
• Linear discriminant analysis is popular when we have more than two response classes,
because it also provides low-dimensional views of the data.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

37

Other forms of discriminant analysis

• When 𝑓𝑘 𝑥 are Gaussian densities, with the same covariance matrix in each class, this
leads to linear discriminant analysis.
• By altering the forms for 𝑓𝑘 𝑥 , we get different classifiers.
• With Gaussians but different 𝚺𝑘 in each class, we get quadratic discriminant analysis.
𝑝
• With 𝑓𝑘 𝑥 = ς𝑗=1 𝑓𝑗𝑘 𝑥𝑗 (conditional independence model) in each class we get
naive Bayes. For Gaussian this means the 𝚺𝑘 are diagonal.
• Many other forms, by proposing specific density models for 𝑓𝑘 𝑥 , including
nonparametric approaches.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

38

19
26/06/2019

Logistic regression, LDA, and Naïve Bayes


• Logistic regression is very popular for classification, especially when K = 2.
• LDA is useful when n is small, or the classes are well separated, and Gaussian assumptions
are reasonable. Also when K > 2.
• Naïve Bayes is useful when p is very large.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

39

GAMs for classification


• As in regression models, we can build non-linear classification models using basis functions
𝑓𝑘 (𝑋𝑗 ) and add them together to make predictions: generalised additive models (GAMs) for
classification.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

40

20
26/06/2019

k-Nearest neighbour (kNN)


• It is a memory-based classification
method (i.e. require no model to be fit).

• Algorithm:
1. Given a query point 𝑥0, find the k
training points 𝑥(𝑟) , 𝑟 = 1, … , 𝑘
closest in distance to 𝑥0.
2. Classify using majority vote among
the k neighbours.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

41

Decision boundaries
• K-NN algorithm does not explicitly compute decision Decision boundary of a 2-class 2-
boundaries. dimensional problem using 7-NN
• The more examples that are stored, the more
complex the decision boundaries can become.
• K-NN heavily suffers from the curse of
dimensionality
• Suppose we have 5000 points uniformly distributed
in the unit hypercube and we want to apply the 5-
NN.
• Suppose our query point is at the origin:
• 1D – on a one dimensional line, we must go a
distance of 5/5000=0.001 on average to capture
the 5 nearest neighbours.
• 2D – in two dimensions, we must go 0.001 to
get a square that contains 0.001 of the volume.
• pD – in p dimensions, we must go 0.0011/𝑝 !
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 42

42

21
26/06/2019

K-Nearest Neighbours
• Advantages: • Disadvantages:
• Simple technique that is easily • Classifying unknown records are
implemented. relatively expensive
• Building model is inexpensive . • Requires distance computation of k-
• Extremely flexible classification nearest neighbours
scheme • Computationally intensive,
• does not involve preprocessing especially when the size of the
• Well suited for training set grows
• Multi-modal classes (classes of • Accuracy can be severely degraded by
multiple forms) the presence of noisy or irrelevant
features
• Records with multiple class labels
• NN classification expects class
• Asymptotic Error rate at most twice conditional probability to be locally
Bayes rate constant
• Cover & Hart paper (1967) • bias of high dimensions
• Can sometimes be the best method

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 43

43

Summary
1. We learnt about regression models and several regression methods: simple and
multiple linear regression, and non-linear regression (polynomial, splines, GAMs)

2. We learnt about classification models and several classification methods: simple


and multiple logistic regression, discriminant analysis, GAMs, Naïve Bayes, and K-
nearest neighbour.

3. We also learnt how to interpret the coefficients of linear and logistic regression
models.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 44

44

22

You might also like