ANN-Regression-Python Examples
ANN-Regression-Python Examples
Halûk Gümüşkaya
Professor of Computer Engineering
web: http://www.gumuskaya.com
e-mail: [email protected], [email protected]
: https://tr.linkedin.com/in/halukgumuskaya
: https://www.facebook.com/2haluk.gumuskaya
Generate Data
Let’s generate some linear-looking data to test this equation.
x1(i)
hθ(x) = θ T X
You need to calculate how much the cost function will change if you
change θj just a little bit. This is called a partial derivative.
Partial derivatives of
the cost function
Notice how
irregular the
steps are.
Note that since instances are picked randomly, some instances may be
picked several times per epoch, while others may not be picked at all.
If you want to be sure that the algorithm goes through every instance at
each epoch, another approach is to shuffle the training set (making sure to
shuffle the input features and the labels jointly), then go through it instance
by instance, then shuffle it again, and so on.
However, this approach generally converges more slowly.
Haluk Gümüşkaya @ www.gumuskaya.com 21
Once again, you find a solution quite close to the one returned
by the Normal Equation:
They all end up near the minimum, but Batch GD’s path actually
stops at the minimum, while both Stochastic
GD and Mini-batch GD continue to walk around.
However, don’t forget that Batch GD takes a lot of time to take each
step, and Stochastic GD and Mini-batch GD would also reach the
minimum if you used a good learning schedule.
Haluk Gümüşkaya @ www.gumuskaya.com 25
Polynomial Regression
What if your data is more complex than a straight line?
Surprisingly, you can use a linear model to fit nonlinear data.
A simple way to do this is to add powers of each feature as
new features, then train a linear model on this extended set of
features.
This technique is called Polynomial Regression.
PolynomialFeatures Class
Use Scikit-Learn’s PolynomialFeatures class to transform our
training data, adding the square (second-degree polynomial) of each
feature in the training set as a new feature (in this case there is just
one feature):
severely overfitting
underfitting
Cross-Validation
You used cross-validation to get an estimate of a model’s
generalization performance.
If a model performs well on the training data but generalizes
poorly according to the cross-validation metrics, then your
model is overfitting.
If it performs poorly on both, then it is underfitting.
This is one way to tell when a model is too simple or too
complex.
Learning Curves
These are plots of the model’s performance on the training set
and the validation set as a function of the training set size (or
the training iteration).
Regularization
A good way to reduce overfitting is to regularize the model
(i.e., to constrain it): the fewer degrees of freedom it has, the
harder it will be for it to overfit the data.
A simple way to regularize a polynomial model is to reduce the
number of polynomial degrees.
For a linear model, regularization is typically achieved by
constraining the weights of the model.
We will now look at Ridge Regression, Lasso Regression, and
Elastic Net, which implement 3 different ways to constrain the
weights.
This forces the learning algorithm to not only fit the data but
also keep the model weights as small as possible.
A linear model (left) and a polynomial model (right), both with various
levels of Ridge regularization
Ridge Regression
closed-form solution
(n+1) x (n+1)
Here is how to perform Ridge Regression with Scikit-Learn using a
closed-form solution (a variant of the above equation that uses a
matrix factorization technique by André-Louis Cholesky):
Lasso Regression
Another regularized version of Linear Regression.
Just like Ridge Regression, it adds a regularization term to the cost
function, but it uses the ℓ1 norm of the weight vector instead of half
the square of the ℓ2 norm.
The same thing as before, but replaces Ridge models
with Lasso models and uses smaller α values.
A linear model (left) and a polynomial model (right), both using various levels of
Lasso regularization
Haluk Gümüşkaya @ www.gumuskaya.com 48
Important Characteristic of Lasso Regression
It tends to eliminate the weights of the least important features
(i.e., set them to zero).
For example, the dashed line in the righthand plot in the the
figure (with α = ) looks quadratic, almost linear:
All the weights for the high-degree polynomial features are
equal to zero.
In other words, Lasso Regression automatically performs
feature selection and outputs a sparse model (i.e., with few
nonzero feature weights).
Lasso Regression
subgradient vector
l1_ratio corresponds to
the mix ratio r):
Early Stopping
A very different way to regularize iterative learning algorithms
such as Gradient Descent is to stop training as soon as the
validation error reaches a minimum.
This is called early stopping.
Logistic Function
Haluk Gümüşkaya @ www.gumuskaya.com 57
The cost function over the whole training set is the average cost
over all training instances. It can be written in a single expression
called the log loss, shown in:
• Once you have the gradient vector containing all the partial derivatives,
you can use it in the Batch Gradient Descent algorithm. That’s it: You
now know how to train a Logistic Regression model.
• For Stochastic GD you would take one instance at a time, and
• For Mini-batch GD you would use a minibatch at a time.
Fancier Plot
Estimated probabilities and decision boundary
• The petal width of Iris virginica flowers (represented by triangles) ranges from 1.4 cm
to 2.5 cm, while the other iris flowers (represented by squares) generally have a
smaller petal width, ranging from 0.1 cm to 1.8 cm.
• Notice that there is a bit of overlap.
• Above about 2 cm the classifier is highly confident that the flower is an Iris virginica (it
outputs a high probability for that class), while below 1 cm it is highly confident that it
is not an Iris virginica (high probability for the “Not Iris virginica” class).
Haluk Gümüşkaya @ www.gumuskaya.com 66
Linear Decision Boundary
The same dataset, displaying 2 features: petal width and length.
Once trained, the Logistic Regression classifier can, based on these
2 features, estimate the probability that a new flower is an Iris
virginica.
The dashed line represents the points where the model estimates a
50% probability: this is the model’s decision boundary.
Note that it is a linear boundary.***
***
It is the the set of points x such that θ0 + θ1x1 + θ2x2 = 0, which defines a straight line.
Haluk Gümüşkaya @ www.gumuskaya.com 67
Other References:
Fit a Linear Regression Model with Gradient Descent from Scratch,
Chris I., Dec 2019,
https://towardsdatascience.com/fit-a-linear-regression-model-with-
gradient-descent-from-scratch-d9bb41bc821e