First Cours 2
First Cours 2
algorithms. 40
35
- More data means more learning 30
Billion USD
material for AI models. 25
20
Go
Age
- We also observe a certain outcome
variable for each subject, say using a
certain product. This is called the target
variable.
SES
- The set of these observations (features
and target) is called the learning data.
Age
- Such an algorithm «learns» from our
observed data and then we use it to
make predictions about new cases.
SES
Issue 1: Too Many Options
Age
Age
SES SES
Solve Issue 1: Restrict
In order to be able to find a solution our algorithms need to be restricted to a certain class of possible
solutions.
Issue 2: The Best Solution
Age
find an optimal solution.
«If we have data let’s look at data. If all we have are opinions, let’s go with mine.»
Jim Barksdale
We simply can not expect to obtain a reliable solution to every problem by collecting data.
A Machine Learning Example
The CIFAR-10* dataset consists of 60000 32x32 color images in 10 classes, with
6000 images per class.
*Alex Krizhevsky, Learning multiple layers of features from tiny images, Tech. report, 2009.
A Machine Learning Example
Denote the images in the dataset by
Input * Output #
Training data
{(xi, yi): i = 1,…, n}
Test data
(x0, y0)
Supervised Learning
Customers Premiums
Customer 1
$$
Customer 2 $
Customer n $$$
Regression
Training data
{(xi, yi): i = 1,…, n}
Supervised Learning
Customers Claims
Customer 1
Valid
Customer 2 Fraud
Customer n Valid
Classification
Training data
{(xi, yi): i = 1,…, n}
Unsupervised Learning
Training data
{xi : i = 1,…, n}
Test data
x0
Unsupervised Learning
Customers Segmentation
Customer 1
Customer 2
Customer n
Clustering
Training data
{xi: i = 1,…, n}
Expectation from Learning
Prediction
What?
Inference
How?
The Problem of Learning
Input Output
Unknown Function
Approximate?
Result
Learning Algorithm 𝑓መ ∶ 𝑋 → 𝑌 𝑓መ 𝑥 ~ 𝑓(𝑥)
መ
𝑥 ⟼ 𝑓(𝑥)
Hypotheses Set:
• Impossible to consider all possible functions
• Considering all possible functions may not give the
best result
• Therefore, we will chose a restricted set ( H )
Low Bias
Bias
𝑓መ
Variance
High Bias
Possible Outcomes
(depends on the hypothesis
set, training data, and the
training algorithm)
How flexible should the model be?
Aim: To find a function which
𝑃(𝑥) = 𝑐𝑛 𝑥 𝑛 + ⋯ + 𝑐1 𝑥 + 𝑐0
n = 10 n = 15
This data came from a function of the form: 𝑌 = 𝑋2 + 𝜖
With new data, compare the performance of n = 2 vs n = 10 or n = 15
n = 2 vs n = 10 n = 2 vs n = 15
Gönenc Onay
November 8, 2023
November 8, 2023 1 / 14
Introduction to Linear Regression
fˆ : R → R
x 7→ (β̂1 )x + (β̂0 ),
November 8, 2023 2 / 14
Features and Targets
Feature x i Target y i
x1 y1
x2 y2
.. ..
. .
xn yn
Figure: Data representation in Linear Regression
November 8, 2023 3 / 14
Fitting by Optimization (OLS)
The objective in linear regression is to minimize the sum of the
squared residuals:
n
X
J(β0 , β1 ) = (y i − (β0 + β1 x i ))2
i=1
This solution ensures a unique solution due to the convex nature of the
function J.
November 8, 2023 4 / 14
Residuals
e i = y i − (β̂0 + β̂1 x i )
November 8, 2023 5 / 14
Assumptions of Linear Regression and OLS Unbiasedness
e i ∼ N(0, σ 2 )
No systematic relationship: e i ↛ ŷ i , x i .
Hence
November 8, 2023 6 / 14
Multiple Linear Regression
y = β̂0 + β̂ · x
Where:
β̂ is a vector of coefficients.
x is the feature vector.
The dot product ensures that each feature is appropriately weighted.
The dimensionality of features can be large in many datasets.
November 8, 2023 7 / 14
Feature Selection and R 2 Test
November 8, 2023 8 / 14
Feature Selection, R 2 Test, and Collinearity
Feature Selection and R 2 :
Including more features can artificially inflate the R 2 value, making the
model appear to fit better than it does in practice.
By selecting only relevant features, we can obtain a more genuine R 2
value that reflects the true explanatory power of our model.
However, overzealous removal of features might decrease the R 2 value
if we remove genuinely informative predictors.
Collinearity:
When features are correlated (collinear), it can be challenging for the
model to determine the individual influence of each feature.
This can result in unstable coefficient estimates.
Solutions:
Variance Inflation Factor (VIF): A measure to detect the presence of
collinearity. A VIF > 10 is typically considered high.
Principal Component Analysis (PCA): A technique to transform and
reduce the dimensionality of the data.
Regularization techniques (like Ridge or Lasso Regression) can help in
managing collinearity.
November 8, 2023 9 / 14
Python Tools for Assessing Linear Regression
November 8, 2023 10 / 14
Non-linear Dependencies and Polynomial Regression
November 8, 2023 11 / 14
Considerations for Polynomial Regression
November 8, 2023 12 / 14
Feature Engineering for Non-linear Regression
Polynomial Features:
These are powers of the original features.
Python’s ‘sklearn.preprocessing.PolynomialFeatures‘ can generate
these.
Use Case: Curve fitting where the relationship between variables
exhibits a polynomial nature.
Gaussian (Radial Basis Function) Features:
Transforms each feature into a Gaussian.
Allows linear models to capture non-linear relationships by implicitly
mapping data to a high-dimensional space.
WWhen combined with linear models, it enables them to handle
non-linear patterns.
Use Case: RBF networks in regression or RBF kernel in SVMs to
capture localized or complex trends (we will see later).
November 8, 2023 13 / 14
Summary
November 8, 2023 14 / 14