0% found this document useful (0 votes)
8 views42 pages

First Cours 2

The document discusses the significance of big data in enhancing AI algorithms by providing more learning material, leading to more accurate models. It outlines the evolution of AI from requiring deep mathematical understanding to modern statistical learning methods, exemplified by AlphaGo's victory over Lee Sedol. Additionally, it covers linear regression as a supervised learning method, its assumptions, and techniques for feature selection and engineering to improve model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views42 pages

First Cours 2

The document discusses the significance of big data in enhancing AI algorithms by providing more learning material, leading to more accurate models. It outlines the evolution of AI from requiring deep mathematical understanding to modern statistical learning methods, exemplified by AlphaGo's victory over Lee Sedol. Additionally, it covers linear regression as a supervised learning method, its assumptions, and techniques for feature selection and engineering to improve model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Big Data

- Data is learning material for AI 45


Big Data Market Share

algorithms. 40
35
- More data means more learning 30

Billion USD
material for AI models. 25
20

- Big data thus enables more 15


10
consistent and accurate AI models. 5
0
2011 2014 2016 2018
Processing Power

Despite speculations, processing


power has been steadily increasing
exponentially.
Processing Power

Despite speculations, processing


power has been steadily increasing
exponentially.
Theoretical Paradigm Shift
In the earlier days of AI, one needed a deep understanding of the problem
in mathematical terms in order to reach useful conclusions using AI
models.

Deep Blue vs Kasparov - 1997


Deep Blue (and its successors) ran on an algorithm
which uses a precise measure of «how good» a given
position is in a game of chess.

Once you can measure how good positions are then


it is not very difficult to select moves which lead
toward better positions.
Theoretical Paradigm Shift
Modern AI algorithms rely on «statistical learning» which, at least in
certain instances, can overcome the lack of mathematical understanding
of a problem.

Go

Go has long been known to be notoriously difficult


for mathematical analysis.

In 1970’s AI researchers thought computers could


not possibly beat the best human players.
Theoretical Paradigm Shift
Statistical learning, expressed through a specific class of models «deep
neural networks», has proven to be extremely powerful by defeating Lee
Sedol decisively.

AlphaGo vs Lee Sedol - 2017

Lee Sedol, acknowledged that winning one game


against AlphaGo was a big accomplishment.

It is know deemed impossible for humans to win a


matchup against AlphaGo.
Typical Statistical Learning Setup *

- We have two measurements on our


subjects, say socioeconomic status and
age. These are called features.

Age
- We also observe a certain outcome
variable for each subject, say using a
certain product. This is called the target
variable.
SES
- The set of these observations (features
and target) is called the learning data.

* Also called Supervised Learning


Typical Statistical Learning Setup

- An algorithm which gives a procedure to


distinguish blue and red points looking at
SES and Age variables is called an AI
model.

Age
- Such an algorithm «learns» from our
observed data and then we use it to
make predictions about new cases.
SES
Issue 1: Too Many Options
Age

Age
SES SES
Solve Issue 1: Restrict
In order to be able to find a solution our algorithms need to be restricted to a certain class of possible
solutions.
Issue 2: The Best Solution

After we decide on the method (random


forests and neural networks are the primary
choices in modern applications) we need to

Age
find an optimal solution.

For example, if we decide to solve this


problem with a line we want to find the line
which gives minimum classification error.
SES
Gradient boosting and gradient descent are
examples of optimization techniques used in
modern AI applications.
Issue 3: Data Availability

«If we have data let’s look at data. If all we have are opinions, let’s go with mine.»
Jim Barksdale

Statistical learning is based on the assumption that we do have


actual data collected from real samples. No data often means no AI.
Issue 4: The Problem Itself

We simply can not expect to obtain a reliable solution to every problem by collecting data.
A Machine Learning Example

The CIFAR-10* dataset consists of 60000 32x32 color images in 10 classes, with
6000 images per class.

Image from www.tensorflow.org

*Alex Krizhevsky, Learning multiple layers of features from tiny images, Tech. report, 2009.
A Machine Learning Example
Denote the images in the dataset by

Denote the labels by


Machine Learning

Input * Output #

Supervised Learning Unsupervised Learning

* independent variable; predictor; feature


# dependent variable; target value
Supervised Learning

Training data
{(xi, yi): i = 1,…, n}

Test data
(x0, y0)
Supervised Learning
Customers Premiums
Customer 1
$$

Customer 2 $

Customer n $$$

Regression
Training data
{(xi, yi): i = 1,…, n}
Supervised Learning
Customers Claims
Customer 1
Valid

Customer 2 Fraud

Customer n Valid

Classification
Training data
{(xi, yi): i = 1,…, n}
Unsupervised Learning

Training data
{xi : i = 1,…, n}

Test data
x0
Unsupervised Learning
Customers Segmentation

Customer 1

Customer 2

Customer n

Clustering
Training data
{xi: i = 1,…, n}
Expectation from Learning

Prediction

What?

Inference

How?
The Problem of Learning

Input Output

Unknown Function

Approximate?

Random Error Term


(Independent of Input)
The Problem of Learning
Target Function Training Data

Result
Learning Algorithm 𝑓መ ∶ 𝑋 → 𝑌 𝑓መ 𝑥 ~ 𝑓(𝑥)

𝑥 ⟼ 𝑓(𝑥)

Hypotheses Set:
• Impossible to consider all possible functions
• Considering all possible functions may not give the
best result
• Therefore, we will chose a restricted set ( H )

Yaser Abu-Mostafa, Caltech Course


https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
The Problem of Learning
Low Variance High Variance
𝑓
Hypotheses Set

Low Bias
Bias

𝑓መ

Variance

High Bias
Possible Outcomes
(depends on the hypothesis
set, training data, and the
training algorithm)
How flexible should the model be?
Aim: To find a function which

• Models the data best


• Performs well in the new data

Let’s try a polynomial:

𝑃(𝑥) = 𝑐𝑛 𝑥 𝑛 + ⋯ + 𝑐1 𝑥 + 𝑐0

What should n be? Big-Small?


n=1 n=2 n=5

n = 10 n = 15
This data came from a function of the form: 𝑌 = 𝑋2 + 𝜖
With new data, compare the performance of n = 2 vs n = 10 or n = 15
n = 2 vs n = 10 n = 2 vs n = 15

Overfitting for n = 10, 15!


n↑ Variance ↑ Bias ↓
Linear Regression

Gönenc Onay

November 8, 2023

November 8, 2023 1 / 14
Introduction to Linear Regression

Linear Regression is a supervised learning algorithm.


It attempts to model the relationship between a dependent variable
and one or more independent variables.
In simple linear regression the goal is to fit a line to a noisy function
(thist is just our data with labels; that is the tuples (x i , y i )) supposed
to be of the form y i := f (x i ) + ϵi , i=1. . . m with f (x) = ax + b.
We try to learn f part so to have a prediction function

fˆ : R → R
x 7→ (β̂1 )x + (β̂0 ),

with fˆ ≃ f (to be precised below).

November 8, 2023 2 / 14
Features and Targets

Features: These are the input variables, independen variables.


Denoted as x i .
Targets: Dependent variables. Denoted as y i .

Feature x i Target y i
x1 y1
x2 y2
.. ..
. .
xn yn
Figure: Data representation in Linear Regression

November 8, 2023 3 / 14
Fitting by Optimization (OLS)
The objective in linear regression is to minimize the sum of the
squared residuals:
n
X
J(β0 , β1 ) = (y i − (β0 + β1 x i ))2
i=1

This is known as the Least Squares Cost Function.


By minimizing this function, we find the best-fitting line.

Analytic Solution: Normal Equations


Given a matrix of features X (augmented with a column of ones for the
intercept) and a vector of target values y , the solution is:

β̂ = (X T X )−1 X T y = (β̂0 , β̂1 ).

This solution ensures a unique solution due to the convex nature of the
function J.
November 8, 2023 4 / 14
Residuals

Suppose we have obtained we have fitted our model, that is we have


obtained fˆ(x) := y i = β̂1 x + β0 .
A residual is the difference between the observed value and the
predicted value.
Mathematically, for each data point i:

e i = y i − (β̂0 + β̂1 x i )

Note that e i ̸= ϵi in general! BUT:


Best we can hope is about the expected values:

November 8, 2023 5 / 14
Assumptions of Linear Regression and OLS Unbiasedness

1 Normally distributed residuals

e i ∼ N(0, σ 2 )

That is with constant variance σ 2 (Homoscedasticity) and zero mean:


E (e i ) = 0.
2 No discernible pattern in the residuals: Residuals should appear
random when plotted against predicted values or any predictors.

No systematic relationship: e i ↛ ŷ i , x i .

If these assumptions hold, OLS is a Unbiased Estimator,i.e.:

E (β̂0 ) = a and E (βˆ1 ) = b.

Hence

November 8, 2023 6 / 14
Multiple Linear Regression

Multiple Linear Regression (MLR) extends simple linear regression


to model the relationship between multiple independent variables and
a dependent variable.
The model we want to fit is represented as:

y = β̂0 + β̂ · x

Where:
β̂ is a vector of coefficients.
x is the feature vector.
The dot product ensures that each feature is appropriately weighted.
The dimensionality of features can be large in many datasets.

November 8, 2023 7 / 14
Feature Selection and R 2 Test

Feature Selection: Not all features might be informative. Reducing


the number of features can:
Improve model interpretability.
Reduce overfitting.
Enhance computational efficiency.
The R 2 value, or the coefficient of determination, measures the
proportion of the variance in the dependent variable that is
predictable from the independent variables.
SSres
R 2 = 1 − SS tot
Where SSres is the sum of squares of the residuals, and SStot is the
total sum of squares.
An R 2 value close to 1 indicates that the model explains a large portion
of the variance in the response.

November 8, 2023 8 / 14
Feature Selection, R 2 Test, and Collinearity
Feature Selection and R 2 :
Including more features can artificially inflate the R 2 value, making the
model appear to fit better than it does in practice.
By selecting only relevant features, we can obtain a more genuine R 2
value that reflects the true explanatory power of our model.
However, overzealous removal of features might decrease the R 2 value
if we remove genuinely informative predictors.
Collinearity:
When features are correlated (collinear), it can be challenging for the
model to determine the individual influence of each feature.
This can result in unstable coefficient estimates.
Solutions:
Variance Inflation Factor (VIF): A measure to detect the presence of
collinearity. A VIF > 10 is typically considered high.
Principal Component Analysis (PCA): A technique to transform and
reduce the dimensionality of the data.
Regularization techniques (like Ridge or Lasso Regression) can help in
managing collinearity.
November 8, 2023 9 / 14
Python Tools for Assessing Linear Regression

1 qqplot from statsmodels:


Helps in visualizing if the residuals follow a normal distribution.
Deviations from the straight line indicate departures from normality.
2 VIF from statsmodels:
The Variance Inflation Factor assesses multicollinearity.
Values > 10 suggest high collinearity and potential issues.
3 scattermatrix from pandas.plotting:
Visualizes pairwise relationships in the dataset.
Useful for quickly spotting linear or non-linear patterns, potential
outliers, or feature relationships.
4 residplot from seaborn:
Plots residuals against fitted values.
Helps in identifying non-linear patterns, heteroscedasticity, or other
issues in residuals.

November 8, 2023 10 / 14
Non-linear Dependencies and Polynomial Regression

Linear regression assumes a linear relationship between the dependent


and independent variables. However, real-world data can often exhibit
non-linear patterns.
Polynomial Regression allows us to model these non-linear
relationships.
It’s a special case of multiple linear regression, where we model the
relationship using a polynomial function of the predictors.
The model we want to fit can be represented as:

y = β̂0 + β̂1 x + β̂2 x 2 + . . . + β̂n x n

November 8, 2023 11 / 14
Considerations for Polynomial Regression

Overfitting: As the degree of the polynomial increases, the model


becomes more flexible and can fit noise in the data. Regularization
techniques (like Ridge or Lasso that we will see later) can help.
Computational Complexity: Higher-degree polynomials increase the
number of features significantly.
Feature Scaling: It’s crucial in polynomial regression since
higher-degree terms can have much larger values.
Always visualize the fit and validate with out-of-sample data to ensure
the model captures the underlying pattern without overfitting.

November 8, 2023 12 / 14
Feature Engineering for Non-linear Regression

Polynomial Features:
These are powers of the original features.
Python’s ‘sklearn.preprocessing.PolynomialFeatures‘ can generate
these.
Use Case: Curve fitting where the relationship between variables
exhibits a polynomial nature.
Gaussian (Radial Basis Function) Features:
Transforms each feature into a Gaussian.
Allows linear models to capture non-linear relationships by implicitly
mapping data to a high-dimensional space.
WWhen combined with linear models, it enables them to handle
non-linear patterns.
Use Case: RBF networks in regression or RBF kernel in SVMs to
capture localized or complex trends (we will see later).

November 8, 2023 13 / 14
Summary

Linear Regression: Supervised method for modeling relationships


between a dependent and one or more independent variables.
Assumptions: Residuals should be normally distributed,
homoscedastic, and independent. Feature multicollinearity should be
avoided.
Multiple Linear Regression: Extends simple linear regression for
multiple features; can model polynomial relationships.
Feature Engineering: Tools like polynomial and Gaussian (RBF)
features enable capturing non-linear patterns.
Model Robustness: Tools like ‘qqplot‘, VIF, and ‘residplot‘ help in
assessing model assumptions and suitability.
Considerations: Beware of overfitting, especially with high-degree
polynomials or numerous features. Always validate and visualize the
model fit.

November 8, 2023 14 / 14

You might also like