0% found this document useful (0 votes)
63 views106 pages

Chapter One1

This document discusses statistical digital signal processing and machine learning. It covers basic concepts of signals and patterns, recognition systems, machine learning definitions and types, and application examples.

Uploaded by

fik55shu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views106 pages

Chapter One1

This document discusses statistical digital signal processing and machine learning. It covers basic concepts of signals and patterns, recognition systems, machine learning definitions and types, and application examples.

Uploaded by

fik55shu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

1.

Introduction

Statistical Digital Signal Processing and Machine


Learning

2/23/2024 1
Content
❖ Basic Concepts

❖ Recognition Systems

❖ What is Machine Learning?

❖ Density Estimation, Regression and Interpolation

❖ Application Examples

2/23/2024 2
1. Basic Concepts
❖ Signal: a varying quantity that carries information about a physical
phenomenon/process under analysis.

❖ Monodimensional Signal: function f(ξ) representing the information


evolution with respect to an independent variable ξ which translates a physical
reality such as time, frequency, pressure, etc… Ex speech signal

❖ Multidimensional Signal: information evolution is simultaneously related


to multiple correlated or uncorrelated physical realities given like g(ξ,γ,δ,….)

2/23/2024 3
1. Basic Concepts
❖ Signal Acquisition: In the process of signal conversion from analog to
digital depicted as follows we have to have means of acquiring physical signals.

❖ Examples of 1-D Signals


❖ Electrocardiographic signal

❖ ECG is acquired using biopotential amplifier

2/23/2024 4
1. Basic Concepts

❖ And this is graphical representation of the sound wave which has been
recorded by microphone.

❖ Speech Signals(Spectrogram)

❖ Time-frequency representation of the speech signal.

❖ Is a tool to study speech sounds (phone).

❖ a visual representation of the range of frequencies of a signal vs time.


2/23/2024 5
1. Basic Concepts
❖ Hidden Markov Models implicitly model spectrograms for speech to
text systems.

❖ Useful for evaluation of text to speech systems.

❖ A high quality text to speech system should produce synthesized


speech whose spectrograms should nearly match with the natural
sentences.

2/23/2024 6
1. Basic Concepts
❖ Examples of 2-D Signals

❖ For grayscale images(8 bit), the pixel values range from 0 to 255.
2/23/2024 7
1. Basic Concepts
❖ Examples of n-D Signals

2/23/2024 8
1. Basic Concepts
❖ Pattern: a form, template, or model (or, more abstractly, a set of rules)
which can be used to make or to generate objects or parts of an object.

❖ Crystal patterns are represented by 2D/3D structures which can be


described through deterministic rules.

2/23/2024 9
1. Basic Concepts
❖ Examples of Crystal Pattern

2/23/2024 10
1. Basic Concepts
❖ Examples of Texture Pattern

2/23/2024 11
1. Basic Concepts
❖ Examples of ECG Pattern

2/23/2024 12
2. Recognition Systems
❖ Pattern recognition aims at classifying data (patterns) based on a priori
knowledge and statistical information extracted from the patterns.

❖ The patterns to be classified are usually groups of measurements, observations


or features, defining points in an appropriate multidimensional space.

❖ Examples of recognition systems: Speech Systems, Optical Character


Recognition (OCR), biometric systems, biomedical monitoring devices,
change detector in sequence of temporal images, etc…

2/23/2024 13
2. Recognition Systems
❖ Example: Geometric Form Recognition

❖ In this case, the pattern is related to the form of the object.

2/23/2024 14
2. Recognition Systems
❖ Example: Radar Detection

❖ The pattern is an intrinsic structure of the received signal, which allows to


infer the presence/absence of a target in it.

2/23/2024 15
2. Recognition Systems
❖ Recognition System: Block Scheme

2/23/2024 16
2. Recognition Systems
❖ Recognition System: Design Phases

2/23/2024 17
2. Recognition Systems
❖ Example: Automatic Fish-Packing Plant

2/23/2024 18
2. Recognition Systems
❖ Pre-processing: apply a segmentation operation in order to isolate fishes
from one another and from the background.

❖ Feature extraction: measure some features or properties from the image


which will help in discriminating the two species of fish considered (e.g., fish
length and width).

❖ Classification: evaluate the evidence presented and make a final decision


as to the species.

2/23/2024 19
2. Recognition Systems
❖ Multisensor Recognition System

❖ In numerous applications, a recognition system needs to analyze complex


phenomena.

❖ For increasing the probability of success of the system, often, one relies on
the acquisition of information from different sensors (sources).

❖ In these scenarios, the system should be capable to fuse conveniently the


multisensor data in order to optimize its performance.

❖ Fusion may happen at two main different levels, namely data/feature and
decision levels, each raising its own methodological problematics.

2/23/2024 20
2. Recognition Systems
❖ Feature-Level Fusion

2/23/2024 21
2. Recognition Systems
❖ Decision-Level Fusion

2/23/2024 22
3. What is Machine Learning?
❖ “Learning is any process by which a system improves performance from
experience.” - Herbert Simon

❖ Definition by Tom Mitchell (1998):

❖ Machine Learning is the study of algorithms that

❖ improve their performance P

❖ at some task T

❖ with experience E.

❖ A well-defined learning task is given by <P, T, E>.

2/23/2024 23
3. What is Machine Learning?
❖ Machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.

Traditional Programing vs Machine Learning

Data
Computer Output
Program

Data
Computer Program
Output
2/23/2024 24
3. What is Machine Learning?
❖ A robot driving learning problem:

❖ T: driving on public four-lane highways using vision sensors

❖ P: average distance traveled before an error

❖ E: a sequence of images and steering commands recorded while


observing a human driver

❖ Hand-written words recognizing learning problem

❖ T: Recognizing hand-written words

❖ P: Percentage of words correctly classified

❖ E: Database of human-labeled images of handwritten words


2/23/2024 25
3. What is Machine Learning?
❖ Example:- Suppose your email program watches which emails you do
or do not mark as spam, and based on that learns how to better filter
spam. What is the task T in this setting?

A. Classifying emails as spam or not spam

B. Watching you label emails as spam or not spam

C. The number of emails correctly classified as spar or not spam

D. None of the above, this is not a machine learning problem.

2/23/2024 26
3. 1 When Do We Use Machine Learning?
❖ ML is used when:

❖ Human expertise does not exist (navigating on Mars)

❖ Humans can’t explain their expertise (speech recognition)

❖ Models must be customized (personalized medicine)

❖ Models are based on huge amounts of data (genomics)

❖ Learning isn’t always useful:


• There is no need to “learn” to calculate payroll
2/23/2024 27
3.1 When Do We Use Machine Learning?
❖ A classic example of a task that requires machine learning: It is very
hard to say what makes a 2

2/23/2024 28
3.1 When Do We Use Machine Learning?
❖ Some more examples of tasks that are best solved by using a learning
algorithm
❖ Recognizing patterns:
❖ Facial identities or facial expressions
❖ Handwritten or spoken words
❖ Medical images
❖ Generating patterns:
❖ Generating images or motion sequences
❖ Recognizing anomalies:
❖ Unusual credit card transactions
❖ Unusual patterns of sensor readings in a nuclear power plant
❖ Prediction:
❖ Future stock prices or currency exchange rates
2/23/2024 29
3.2 Areas of Applications
❖ Web search
❖ Computational electromagnetics
❖ Wireless Communication: Modulation Recognition, Channel State
Information Prediction for 5G Wireless Communications
❖ Finance
❖ E-commerce
❖ Space exploration
❖ Robotics
❖ Information extraction / signal processing
❖ Social networks
❖ Communication systems
❖ This days machine learning is involving in every areas.

2/23/2024 30
3. 3 Types of Learning
❖ Supervised (inductive) learning

❖ Given: training data + desired outputs (labels)

❖ Unsupervised learning

❖ Given: training data (without desired outputs)

❖ Semi-supervised learning

❖ Given: training data + a few desired outputs

❖ Reinforced learning

❖ Rewards from sequence of actions

2/23/2024 31
Supervised Learning: Regression
❖ is the most common type of machine learning

❖ Given (x1, y1), (x2, y2), ..., (xn, yn)

❖ Learn a function f(x) to predict y given x

❖ y is real-valued == regression
9
September Arctic Sea Ice Extent

8
7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020

2/23/2024 32
Supervised Learning: Regression
❖ Given (x1, y1), (x2, y2), ..., (xn, yn)

❖ Learn a function f(x) to predict y given x


❖ y is categorical == classification
❖ Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size
Predict Benign Predict Malignant

Tumor Size
2/23/2024 33
Supervised Learning: Regression
❖ x can be multi-dimensional

❖ Each dimension corresponds to an attribute

- ClumpThickness
- Uniformity of Cell Size
Age
- Uniformity of Cell Shape

Tumor Size

2/23/2024 34
Unsupervised Learning
❖ Given x1, x2, ..., xn(without labels)

❖ Output hidden structure behind the x’s


❖ Example clustering

2/23/2024 35
Unsupervised Learning

Organize computing clusters


Social network analysis

Market segmentation Astronomical data analysis

2/23/2024 36
Unsupervised Learning
❖ Independent component analysis – separate a combined signal into its
original sources

2/23/2024 37
Unsupervised Learning
❖ Clustering algorithm can used in different application sectors like Google
search engine, individual identification using genes, organize computing
clusters, social network analysis, market segmentation and astronomical data
analysis.

❖ In large data center such algorithm can be used to figure out which machines
turns to work together to maximize efficiency of the center.

❖ Many business centers have huge database of customer information. Given


this customer data set the algorithm is set to group automatically customers
into different market segments so that one can automatically and more
efficiently sell to a different market segments.

2/23/2024 38
Answer the following

2/23/2024 39
4. Density Estimation, Regression and Interpolation
❖ Statistical classifiers require implicitly or explicitly as first step of the
learning process the estimation of the density that a member of a certain
class will be found to express particular features.

❖ Whereas the outputs for classification are discrete class labels, regression is
concerned with the prediction of continuous quantities.

❖ Regression systems seek to find some functional description of data,


often with the goal of predicting values for new input.

❖ Interpolation can be seen as a particular case of regression since it aims at


inferring a prediction function over specific ranges of input.

2/23/2024 40
5. Application Examples
❖ Remote sensing (e.g., generation of thematic and change maps,
environmental risk assessment)

❖ Target recognition in radar and sonar signals;

❖ Optimal receivers for telecommunication systems;

❖ Industrial applications (e.g., automatic product quality control, testing


and diagnosis systems for industrial machineries);

❖ Speech recognition (e.g., call-centers);

❖ Optical character recognition (OCR);

2/23/2024 41
5. Application Examples
❖ Biomedical signal analysis (e.g., support to diagnosis and monitoring,
telemedicine);

❖ Biometry (person authentication/identification based on digital fingerprint,


iris analysis…);

❖ Video-surveillance of public and private environments (e.g., airports,


stadiums, parking);

❖ Robotics (computer vision);

❖ Bioinformatics (DNA and microarray analysis).

2/23/2024 42
5. Application Examples
❖ Example: Video-Surveillance

❖ An automatic recognition system can support an agent by detecting


(early) warnings due to the presence of suspect objects or panic situations
through acoustic and radiometric sensors put in different positions.

2/23/2024 43
5. Application Examples

❖ Example: Video-Surveillance in Tourism Management

2/23/2024 44
5. Application Examples
❖ Example: Biomedical Monitoring

2/23/2024 45
5. Application Examples
❖ Example: Cardiac pathology Detection

2/23/2024 46
5. Application Examples
❖ Example: Biometry

2/23/2024 47
5. Application Examples
❖ Example: Sonar Applications

2/23/2024 48
5. Application Examples
❖ Example: Intelligent Transportation

2/23/2024 49
5. Application Examples
❖ Example: Ground Penetrating Radar

2/23/2024 50
References

❖ Farid MELGANI, “Lecture note on Recognition Systems”,


University of Trento

2/23/2024 51
2. Regression and Regularization

2/23/2024 52
Contents
❖ Linear Regression with One Variable

❖ Linear Regression with Multiple Variable

❖ Logistic Regression

❖ Regularization
❖ Regularized Linear Regression

❖ Regularized Logistic Regression

2/23/2024 53
2.1 Linear Regression with One Variable
❖ Housing price prediction- case of Bdr ❖ Training set of housing prices
Size in feet2 (x) Price($) in 1000’s (y)
2104 460
280k
1416 232
1534 315
852 178
… …

1500ft2
❖ Supervised learning: given the ‘right answer’ for each example in the data,
predicting real valued o/p.
❖ Such a problem is called regression problem.

2/23/2024 54
2.1 Linear Regression with One Variable
❖ Let
❖ m = number of training examples
❖ x’s = “input” variable/ features
❖ y’s = “output” variable / “target” variable

❖ We can represent with block as follows:


Training set

Learning Algorithm

size of house h Estimated price


(x) (estimated value of y)

❖ h- hypothesis maps from x’s to y’s.


2/23/2024 55
2.1 Linear Regression with One Variable
❖ How do we represent h?
y
4
3.5
3 h(x) = 0 + 1x
2.5
2 ❖ h(x) or simply h(x)
1.5
given above is linear
1
regression with one
0.5
0
variable or univariant
0 1 2 3 4 x linear regression

❖ Hypothesis: h(x) = 0 + 1x

❖ Parameters: i’s

❖ How to choose i’s?


2/23/2024 56
2.1 Linear Regression with One Variable
❖ The idea is choose 0, 1 so that h(x) is close to y for training examples (x,
y).

❖ Fit model by minimizing sum of squared errors


y
4.5 𝑚
4 1 2
min ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
3.5
𝜃0 ,𝜃1 2𝑚
3 𝑖=1
(xi, yi)
𝑥𝑖 = 𝜃0 + 𝜃1 𝑥 𝑖
2.5
2
Where ℎ𝜃
1.5 𝑚
1 2
෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
1
𝐽 𝜃0 , 𝜃1 =
0.5
2𝑚
0 𝑖=1
0 1 2 3 4
x ⇒ min 𝐽 𝜃0 , 𝜃1
𝜃0 ,𝜃1
𝐽 𝜃0 , 𝜃1 is cost function or squared error function

2/23/2024 57
2.1 Linear Regression with One Variable
❖ Hence, for linear regression with one variable

❖ Hypothesis : h(x) = 0 + 1x

❖ Parameters: 0 , 1

1 2
❖ Cost Function: 𝐽 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥𝑖 −𝑦 𝑖
2𝑚 𝑖=1

❖ Goal: min 𝐽 𝜃0 , 𝜃1
𝜃0 ,𝜃1

2/23/2024 58
2.1 Linear Regression with One Variable
❖ Let’s assume x  R so = [0, 1], then for given data h(x) and J(1) for
0 = 0:
J(1)
h(x)
(function of the parameter 1)
(for fixed 1 , this is function of x

2/23/2024 59
2.1 Linear Regression with One Variable
h(x) J(1)
(for fixed 1 , this is function of x (function of the parameter 1)

1 2 2
𝐽 0,1 = [ 0.5 − 1 + 1−2 + 1.5 − 3 2 ] ≈ 0.58
2𝑥3

2/23/2024 60
2.1 Linear Regression with One Variable
h(x) J(1)
(for fixed 1 , this is function of x (function of the parameter 1)

1 2 2
𝐽 0,0 = [ 0 −1 + 0−2 + 0 − 3 2 ] ≈ 2.333
2𝑥3

2/23/2024 61
2.1 Linear Regression with One Variable
❖ If we vary both 𝜃0 𝑎𝑛𝑑 𝜃1 , then if we plot cost function vs 𝜃0 𝑎𝑛𝑑 𝜃1

2/23/2024 62
2.1 Linear Regression with One Variable
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 63
2.1 Linear Regression with One Variable
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 64
2.1 Linear Regression with One Variable
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 65
2.1 Linear Regression with One Variable
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 66
a. Basic Search Procedure
❖ Choose initial values for 
❖ Until we reach a minimum:
❖ Choose a new value for to reduce J()

J(  )



❖ Since the least squares objective function is convex (concave), we
don’t need to worry about local minima
2/23/2024 67
Gradient Descent
❖ Plotting and manually trying to find values of θ0 and θ1 that minimizes cost
function J(θ0, θ1) is very difficult.

❖ The most common algorithm that can be used to find value of θ0 and θ1 that
minimizes cost function J(θ0, θ1) automatically is gradient descent algorithm.
❖ Steps
❖ Initialize θ
❖ Repeat until convergence
{
𝜕
𝜃𝑗 : = 𝜃𝑗 − 𝛼 𝐽(𝜽) for j = 0 and j = 1
𝜕𝜃𝑗
}
Learning rate, it controls how big a step we
take down hill with gradient descent, 𝛼 = 0.05
2/23/2024 68
Gradient Descent
❖ Correct simultaneous update
𝜕
𝑡𝑒𝑚𝑝0 ∶= 𝜃0 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝜕𝜃0
𝜕
𝑡𝑒𝑚𝑝1 ∶= 𝜃1 − 𝛼 𝐽(𝜃0 , 𝜃1 )
𝜕𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
❖ Lets choose θ1 on the graph and update as follows.

J(1) for 1  R

1

2/23/2024 69
Gradient Descent
❖ And the derivative of J(θ1) at θ1 means the slope of the line that
passes through θ1.

❖ The slope tangent at θ1 has a positive value.


𝜕
θ1 ≔ θ1 − α 𝐽(𝜃1 )
J(1) 𝜕𝜃𝑗
θ1 ≔ θ1 − α . (𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟)

1

2/23/2024 70
Gradient Descent
❖ Lets reconsider our previous graph picking a different θ1

❖ The slope tangent through current θ1 has a negative value.


𝜕
𝐽(𝜽𝟏 ) ≤𝟎
𝜕𝜃𝑗
J(1)
θ1 ≔ θ1 − α. (negative number)

1

2/23/2024 71
Gradient Descent
𝜕
𝜃1 ∶= 𝜃1 − 𝛼 𝐽(𝜃1 )
𝜕𝜃1

If  is too small, gradient descent can


be slow

If  is too large, gradient descent can


overshoot the minimum. It may fail
to converge, or even diverge.

2/23/2024 72
Gradient Descent
❖ Lets say you initialize θ1 at local optimum

❖ At that point the derivative term will be zero thus if you are at the local optimum
leave θ1 unchanged.
𝜃1 ≔ 𝜃1 − 𝛼. 0
𝜃1 : = 𝜃1
2/23/2024 73
Gradient Descent
❖ Gradient descent can converge to a local minimum, even with a learning rate α
fixed.

❖ At that point the derivative term will be zero thus if you are at the local optimum
leave θ1 unchanged.

2/23/2024 74
Gradient Descent
❖ Now lets apply the gradient descent algorithm to the linear
regression model. 𝑚
1 2
𝐽 𝜃0 , 𝜃1 = ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
2𝑚
𝑖=1
❖ Then 𝑚
𝜕 𝜕 1 2
𝐽 𝜃0 , 𝜃1 = . ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1

❖ For ℎ𝜃 𝑥 𝑖 = 𝜃0 + 𝜃1 𝑥 𝑖
𝑚
𝜕 𝜕 1 2
𝐽 𝜃0 , 𝜃1 = . ෍ 𝜃0 + 𝜃1 𝑥 𝑖 − 𝑦 𝑖
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1

2/23/2024 75
Gradient Descent
❖ Then 𝑚
𝜕 1
𝑗 = 0: 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )
𝜕𝜃0 𝑚
𝑖=1
𝑚
𝜕 1
𝑗 = 1: 𝐽 𝜃0 , 𝜃1 = ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) ) . 𝑥 (𝑖)
𝜕𝜃1 𝑚
𝑖=1
❖ Thus the gradient descent becomes
repeat until convergence{
1 𝑚
𝜃0 : = 𝜃0 − 𝛼 σ𝑖=1(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )
𝑚
1
𝜃1 : = 𝜃1 − 𝛼 σ𝑚 (ℎ 𝑥 (𝑖) − 𝑦 𝑖 ). 𝑥 𝑖
𝑚 𝑖=1 𝜃
}
❖ Here do not forget to update θ0 and θ1 simultaneously.

2/23/2024 76
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

h(x) = -900 – 0.1 x

2/23/2024 77
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 78
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 79
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 80
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 81
Gradient Descent
h(x) J(0, 1)
(for fixed 0 and 1 , this is function of x (function of the parameter 0, 1)

2/23/2024 82
Gradient Descent
❖ To see if gradient descent is working, print out J() each iteration
❖ The value should decrease at each iteration

❖ If it doesn’t, adjust α
Linear Regression With Multiple Variables/Features
❖ Previously we discussed linear regression with one variable (size of
housing) to predict the price of the house(y).

❖ But size of the house is not the only feature to predict the price but
also the number of bedrooms, the number of floors and the age of
home(years).
2/23/2024 83
Linear Regression With Multiple Variables/Features
❖ Lets denote features as X1, X2, X3 and X4 and the output y as shown
below.

❖ Having this lets introduce some more notations.

2/23/2024 84
Linear Regression With Multiple Variables/Features
❖ From the above example we considered

❖ n=4

❖ 𝑥 (1) = 1416 3 2 40 , and here we can see that 𝑥 (1) is n


dimensional vector
(𝑖)
❖ A specific feature value in a training set denoted as 𝑥𝑗 and for
(1)
example in our above example 𝑥3 = 2.

❖ And for multiple variable case specifically above example,


hypothesis is given by:

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3 + 𝜃4 𝑥4
2/23/2024 85
Linear Regression With Multiple Variables/Features
❖ Here remember that a hypothesis is trying to predict price of the
house in thousands of dollars.

❖ For example we can take the following as a sample hypothesis


ℎ𝜃 𝑥 = 80 + 0.1𝑥1 + 0.01𝑥2 + 3𝑥3 − 2𝑥4

❖ We can generalize expression of hypothesis for n features as


follows.
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛

❖ To simplify further and for convenience of notation, lets define


𝑥0 = 1.

❖ And thus our feature vector x become n+1 dimensioned vector.


2/23/2024 86
Linear Regression With Multiple Variables/Features

❖ And hypothesis can be written as:


ℎ𝜃 𝑥 = 𝜃0 𝑥0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
= 𝜃𝑇𝑥

❖ This is the form of hypothesis when we have multiple features and


also called multivariate linear regression.
2/23/2024 87
Gradient Descente for Multiple Variable
❖ Here we will see how to fit parameters to a given data set applying
gradient descent.

❖ For linear regression with multiple feature we have the

❖ Hypothesis: ℎ𝜃 𝑥 = 𝜃 𝑇 𝑥 = 𝜃0 𝑥0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛

❖ Parameters: 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛

❖ Cost function:
𝑚
1 2
𝐽 𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 = ෍ ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖)
2𝑚
𝑖=1

2/23/2024 88
Gradient Descente for Multiple Variable
❖ Gradient descent for multiple variable is:
Repeat {
𝜕
θj ≔ θj − α 𝐽(𝜃0 , 𝜃1 , 𝜃2 … 𝜃𝑛 )
𝜕𝜃𝑗
} simultaneously update for every j = 0, …, n
❖ After taking partial derivative we will find the following algorithm.
Repeat {
𝑚
1
θj ≔ θj − α ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )𝑥𝑗𝑖
𝑚
𝑖=1
(simultaneously update j for j = 0,…,n)
}

2/23/2024 89
Gradient Descente for Multiple Variable
❖ If we substitute, we will have following expressions for each
parameters.
1 𝑚
θ0 ≔ θ0 − α σ (ℎ 𝑥 (𝑖) − 𝑦 (𝑖) )𝑥0𝑖
𝑚 𝑖=1 𝜃
𝑚
1
θ1 ≔ θ1 − α ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )𝑥1𝑖
𝑚
𝑖=1
𝑚
1
θ2 ≔ θ2 − α ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )𝑥2𝑖
𝑚
𝑖=1
….
𝑚
1
θn ≔ θ𝑛 − α ෍(ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) )𝑥𝑛𝑖
𝑚
𝑖=1

2/23/2024 90
Gradient Descente for Multiple Variable
❖ To make sure that gradient descent is working correctly, we can draw J(θ) as
gradient descent runs that is J(θ) vs number of iteration.
❖ And thus a gradient descent is working properly if J(θ) decreases after every
iteration.
❖ Typical graph of J(θ) vs number of iterations is given below.

2/23/2024 91
Gradient Descente for Multiple Variable
❖ As we can see from the graph 300 to 400 is flat and thus it has less chance of
converge.
❖ So by looking at the graph we can judge weather gradient descent converges or not.
❖ There is also automatic way of deciding if gradient descent can converge or not.
❖ Declare convergence if J(θ) decreases by less than 0.001 in one iteration.
❖ In other cases, there might be gradient descent where function plot gives graphs like
given below.
❖ During these times to make the gradient descent converge we can choose smaller
learning rate.

2/23/2024 92
Gradient Descente for Multiple Variable

❖ Note that for sufficiently small α, J(θ) should decrease on every iteration but if α is
too small gradient descent can be slow to converge.

2/23/2024 93
Improving Learning: Feature Scaling
❖ The idea is make sure that features are on a similar scale.
❖ For example if we have a problem with feature values:
x1 = size(0-2000feet2)
x2 = number of bedrooms(0-5)
❖ And in such problem if we plot the cost function, we will get tall and skinny
contour and gradient descent will take long time to converge to minima of
the cost function.
❖ When we face such situation we can scale features as follows.
𝑠𝑖𝑧𝑒 𝑓𝑒𝑒𝑡 2 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠
𝑥1 = and 𝑥2 =
2000 5

2/23/2024 94
Improving Learning: Feature Scaling
❖ After feature scaling cost function(contour plot) will much less skinny as
given below gradient descent will converge much faster.
20 20
15 15
2 10
2 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
1 1
❖ Feature scaling we brought range of features 0 ≤ 𝑥1 ≤ 1 𝑎𝑛𝑑 0 ≤ 𝑥2 ≤ 1
which enables gradient descent to converge much faster.

2/23/2024 95
Improving Learning: Feature Scaling
❖ Here it does not mean it must be in a −1 ≤ 𝑥𝑖 ≤ 1 range, for example we
can have:
❖ 0 ≤ 𝑥1 ≤ 3,
❖ −2 ≤ 𝑥2 ≤ 0.5,
❖ −100 ≤ 𝑥3 ≤ 100 𝑛𝑜𝑡 𝑝𝑟𝑜𝑝𝑒𝑟 𝑟𝑎𝑛𝑔𝑒 𝑎𝑠 100 𝑖𝑠 𝑓𝑎𝑟 𝑓𝑟𝑜𝑚 1.
❖ −0.0001 ≤ 𝑥4 ≤ 0.0001poorly scaled and
❖ From here we have to note that our features may not be in the same range of
values as far as they are close to −1 ≤ 𝑥𝑖 ≤ 1 range.

2/23/2024 96
Normal Equation
❖ Here we will see normal equation which for some linear regression problems will
give us much better way to solve the optimum value of parameter θ.
❖ Gradient descent needs a number of iterations to reach the optimum value where as
normal equation which is analytical method it takes one step to get optimum value.
❖ We know how to determine the minimum value from our calculus and here also the
same principle will be applied.

2/23/2024 97
Normal Equation
❖ Example m=4

❖ To apply normal equation take these data sets and add on extra
column for x0 and then we will find:

2/23/2024 98
Normal Equation

❖ Next construct a matrix X which contains all features and a vector y form outputs.

❖ Where X is m x (n+1) matrix and y m-dimensional column vector.


❖ Finally you take X transpose and multiply by X then the whole inverse multiplied
by X transpose by y equate to θ and solve for θ will give values of θ that minimizes
cost function.

2/23/2024 99
Normal Equation

❖ To generalize for m number of data set (𝑥 1 , 𝑦1 ), … , (𝑥 𝑚 , 𝑦 𝑚 ) and n number of


features

❖ Then matrix X called the design matrix will be:

2/23/2024 100
Normal Equation

❖ Then matrix X called the design matrix will be:

(𝑥 (1) )𝑇
(𝑥 (2) )𝑇
𝑋 = (𝑥 (3) )𝑇

(𝑥 𝑚 )𝑇

❖ And thus after setting each we can evaluate the following equation.

❖ Inverse and transpose of a matrix can be implemented on matlab/octave and it is as


follows:
❖ We used feature scaling for gradient descent method but not necessary in normal
equation methods.
2/23/2024 101
Normal Equation

❖ Lets see advantages and disadvantages of gradient descent and normal equation
methods.

❖ Normal equation method may be feasible for n not more than thousands, but if
higher better to go for gradient descent.

2/23/2024 102
Normal Equation
❖ Normal Equation and Non-invertibility

❖ There are two conditions which cause non-invertibility to occur.

2/23/2024 103
Linear Regression With Multiple Variables/Features

❖Quiz
1. Suppose you are working on weather prediction, and your
weather station makes one of three predictions for each day's
weather: Sunny, Cloudy or Rainy. You'd like to use a learning
algorithm to predict tomorrow's weather.
Would you treat this as a classification or a regression
problem?
a. Regression
b. Classification

2/23/2024 104
Linear Regression With Multiple Variables/Features

❖Quiz
2. Suppose you have 14 training examples with 3 features.
The normal equation is θ=(XTX)−1XTy. For the given values
of m and n, what are the dimensions of θ, X, and y in this
equation?
a. X is 14×4, y is 14×1, θ is 4×1
b. X is 14×3, y is 14×1, θ is 3×3
c. X is 14×3, y is 14×1, θ is 3×1
d. X is 14×4, y is 14×4, θ is 4×4

2/23/2024 105
Linear Regression With Multiple Variables/Features

❖Quiz
3. You run gradient descent for 15 iterations with α=0.3 and
compute J(θ) after each iteration. You find that the value of
J(θ) increases over time. Based on this, which of the
following conclusions seems most plausible?
a. Rather than use the current value of α, it'd be more
promising to try a larger value of α (say α=1.0).
b. α=0.3 is an effective choice of learning rate.
c. Rather than use the current value of α, it'd be more
promising to try a smaller value of α (say α=0.1).

2/23/2024 106

You might also like