Machine Learning Shortnote
Machine Learning Shortnote
Supervised Learning
Algorithm learns to map a given input to a
Single data item with multiple attributes desired output, so it can map an unforeseen
We call it input to an output.
• Samples • Spam filters
• Data Points • Handwritten character recognition
• Data Vectors • Credit limit
• Data Items
Supervised Learning Problems
• Regression Problems
- Real Valued Output
• Classification Problems
- Categorical / Discrete Output
Classification or Regression ?
• Predicting the salary of a fresh graduate
• Email – Spam or Not
• Tumor – Cancer or Not
•Handwritten character recognition
• Predicting weather – Sunny, Rainy,
Cloudy, Windy
Unsupervised Learning
Algorithm learns without having desired output
for a given input. It just attempts to find out the
structure within the data set (identify similarities
and divisions among data).
Lecture 03
Regression
• A statistical method
• Attempts to determine the strength /
relationships between one dependent
variable (usually denoted by Y) and a series
of independent variables (usually x).
Notation
•𝑥 – input (multivariate data point/ vector)
•𝑑 – number of features (dimensions)
•𝑥𝑖 – 𝑖 th data point
•𝑥𝑖 𝑘 – 𝑘 th feature of 𝑖 th data point
Convergence
• Plot cost function and identify error
minimized and stabled
• Predefined tolerance level for cost change
Normal Equations
• Gradient Descent – Iterative process.
• Normal Equations – Solve a mathematical
equation and finds optimal values for
parameters (𝜃 𝑘 ).
• Directly finds the value of 𝜃 without
Gradient Descent iterative process.
• Effective when the data set has less
features.
• Take the partial derivative for each 𝜃 𝑘 and
equals it to zero.
• Solve the equation and find 𝜃 𝑘 for each
parameter
Normal Equation Method
• Construct the matrix 𝑋 from the sample
data
• Construct the vector 𝑦 from the available
labels
• Compute the 𝑝𝑠𝑒𝑢𝑑𝑜 𝑖𝑛𝑣𝑒𝑟𝑠𝑒 of 𝑋
• Compute 𝜃 by
𝜃 = (𝑋 𝑇 𝑋) −1
𝑋 𝑇 𝑦
Logistic Regression
Linear Regression for Classification
• In our linear regression model, the
hypothesis is, 𝑔(𝑥) = 𝜃 𝑇 𝑥
• It gives real valued output for any given
input.
• Linear regression doesn’t suit classification
problems.
Classification
• What we want is a hypothesis that restricts
the output. 0 ≤ ℎ𝜃(𝑥) ≤ 1
• 0 − 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐶𝑙𝑎𝑠𝑠
• 1 − 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐶𝑙𝑎𝑠𝑠
Logistic Regression
• Uses the linear regression hypothesis ℎ(𝑥)
• Apply restrictive function on 𝑔(𝑥)
• It will be the classification hypothesis 𝑘(𝑥)
𝑘(𝑥)= 𝑔(ℎ(𝑥))
• We use 𝑔 − 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝐹𝑢𝑛𝑐𝑡𝑖𝑜n
Logistic Function
• Always provides output 0 ≤ 𝑠𝜃(𝑥) ≤ 1
• If 𝑠𝜃(𝑥) = 0.82
• Can use a threshold 0.5 and
declare it is class 1
• Can state probability of being class
1 is 82%
• Probability of being class 0 is 18%
correct value which you try to predict.
Underfit
• Variance : how much your hypothesis
changes from the target function if the
training data set is changed. Overfit
Avoiding Overfitting
• Reduce number of features
• Manual feature reduction
• Model selection algorithm
• Algorithms to select features and
get rid of some features
• Regularization
• No feature reduction
Regularization
• Keep all features
• Reduce magnitude of parameters/ weights
• It has been proven parameters with
reduced magnitude makes a function
Lecture 04 smother
Regularization • Work well with high dimensional data
Underfitting/ Overfitting • Regularization helps when,
• Underfitting – High Bias • Each dimension contribute to the
• Just Fit – Quadratic Function might just fit final output
• Overfitting – High Variance • Not sure the influence of dimension
to the final output
Bias – Variance Tradeoff
• Approximation vs Generalization
• Bias : the difference between the average
prediction of your hypothesis and the
𝜆 in the Equation
• If 𝜆 is very large
• Penalizes all the 𝜃 parameters and
Regularization brings near zero
• When smaller the parameter values, • Hypothesis becomes flat
• Simpler hypothesis • Underfit – High bias
• Less prone to overfitting • If 𝜆 is very small
• With high dimensional data it is difficult to • Almost no influence over 𝜃
identify which parameter should be parameters
penalized. • Hypothesis tend to overfit
• Thus, with regularization we attempt to • High Variance
shrink all the parameters.
ANN - Artificial Neural Networks
Why ?
• Difficulty of modelling complex hypotheses
with linear models.
• Polynomial Curve Fitting difficult with a
large number of features.
Activation Function: Thresholding
• Thresholding makes harsh decision
• Smoother transition is preferred
• Other Activation Functions
• Sigmoid functions • Relu
Note
• Activations of layer N will be the input
vector for the layer N+1.
• Activations of layer N+1 is a function of
activations of layer N.
• Activations of layer N+1 is function of
original input vector.
• (When N = 1, which is a special case and
activation of Layer 1 is simply regarded as
the input vector itself)
Lecture 05