0% found this document useful (0 votes)
40 views65 pages

2EL1730 ML Lecture07 Neural Networks

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views65 pages

2EL1730 ML Lecture07 Neural Networks

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning

2EL1730

Lecture 7
Neural Networks

Fragkiskos Malliaros and Maria Vakalopoulou

Friday, January 09, 2020


Acknowledgements

• The lecture is partially based on material by


– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Andrej Karpathy (Stanford Univesity)
– Nina Balcan and Matt Gormley (CMU)
– Ricardo Gutierrez-Osuna (Texas A&M Univ.)
– Tan, Steinbach, Kumar
• Introduction to Data Mining

Thank you!

2
Last lectures

3
Linear (Least-Squares) Regression

• Learning: finds the parameters that minimize some objective


function

We minimize the sum of the squares:

• (Stochastic) gradient descent


• Or, closed-form solution:
4
Logistic Regression

• How to turn a real-valued expression into a


probability

• Replace the sign() with the sigmoid or logistic function

σ(ζ)
where

z
5
k-Nearest Neighbors (kNN) Algorithm

1NN 3NN

Every example in Every example in


the blue shaded the blue shaded
area will be area will be
misclassified as classified correctly
the blue class as the red class

• Algorithm 1 is sensitive to mis-labeled data (‘class noise’)


• Consider the vote of the k nearest neighbors (majority vote)

Algorithm kNN
• Find k examples (x*i, y *i), i=1,…,k closest to the test instance x
• The output is the majority class
6
Choice of Parameter k

• Small k: noisy decision


– The idea behind using more than 1 neighbors is to average out the
noise
• Large k
– May lead to better prediction performance
– If we set k too large, we may end up looking at samples that are not
neighbors (are far away from the point of interest)
– Also, computationally intensive
– Extreme case: set k=m (number of points in the dataset)
• For classification: the majority class
• For regression: the average value

7
Decision Tree Induction – The Idea

• Basic algorithm
– Tree is constructed in a top-down recursive manner
– Initially, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on the selected
attributes
– Split attributes are selected on the basis of a heuristic or statistical
measure (e.g., gini index, information gain)

• Most commercial DTs use variations of this algorithm

8
Ensemble Methods

• Typical application: classification


• Ensemble of classifiers: set of classifiers whose individual decisions are
combined in some way to classify new examples
• Simplest approach:
1. Generate multiple classifiers (e.g., decision trees, logistic
regression)
2. Each classifier votes (decides) on a test instance
3. Take majority as classification
• Classifiers are different due to different sampling of training data, or
randomized parameters within the classification algorithm

9
Ensemble Methods: Summary

• Differ in training strategy and combination method

• Bagging (bootstrap aggregation)


– Random sampling with replacement
– Train separate models on overlapping training sets, average their
predictions
– E.g., random forest classifier

• Boosting
– Sequential training, iteratively re-weighting the training examples –
the current classifier focuses on hard examples
– E.g., Adaboost

10
Support Vector Machines (SVM)

https://math.stackexchange.com/questi
ons/1305925/why-does-the-svm-
margin-is-frac2-mathbfw 11 11
Support Vector Machines (SVM)

o Linearly separable case: hard-margin SVM


o Non-separable, but still linear: soft-margin SVM
o Non-linear: kernel SVM
o Kernels for
o Real-valued data
o Strings
o Graphs

12
In this Lecture

• Neural Networks
• Perceptron
• Multilayer perceptron
• Stochastic gradient descent
• Backpropagation

13
Neural Networks

14
Applications

• Importance of Neural Networks

15
Applications

• Importance of Neural Networks

16
Applications

• Importance of Neural Networks

17
Perceptron – The Idea

• Biology inspired

18
Perceptron

• Perceptron [Rosenblatt 1962]: A linear discriminant model for


binary classification

OUTPUT

WEIGHTS

BIAS INPUT
.. How can we do classification?

19
Perceptron

• Perceptron [Rosenblatt 1962]: A linear discriminant model for


binary classification

OUTPUT

threshold function

..
The decision boundary is a hyperplane

20
Perceptron

• Perceptron [Rosenblatt 1962]: A linear discriminant model for


binary classification

..

21
Perceptron

• Perceptron [Rosenblatt 1962]: A linear discriminant model for


binary classification

Activation functions:
.. 1) step function
2) sigmoid function
3) Tanh
4) Softmax
...

22
Perceptron

• Perceptron [Rosenblatt 1962]: A linear discriminant model for


binary classification

Loss functions:
.. 1) Cross entropy:

2) Mean square error


...

23
Perceptron

• Example of Perceptron: Learn the operation AND

x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1

24
Perceptron

• Example of Perceptron: Learn the operation AND

x1 x2 y
0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 f(x)
0 0 s(-1.5 + 0 + 0) = s(-1.5) = 0
0 1 s(-1.5 + 0 + 1) = s(-0.5) = 0
1 0 s(-1.5 + 1 + 0) = s(-0.5) = 0
1 1 s(-1.5 + 1 + 1) = s(0.5) = 1 25
Training a Perceptron

• Rosenblatt's innovation was mainly the learning algorithm for


perceptrons
• Learning algorithm
• Initialize weights randomly
• Take one sample x i and predict y i
• For erroneous predictions update weights
• If prediction y' = 0 and ground truth y i = 1, increase weights
• If prediction y' = 1 and ground truth y i = 0, decrease weights
• Repeat until no errors are made

26
From a single layer to multiple layers

• 1 perceptron == 1 decision
• What about multiple decisions?
• E.g. digit classification
• Stack as many outputs as the
possible outcomes into a layer
• Neural Network

27
Perceptron

• Multiclass Classification

Choose Ck if:
..
To get probabilities we use the softmax:
..
If the output for one class is sufficiently larger
than from the others its softmax will be close to
1 (0 otherwise)

28
What is a potential problem with perceptrons?

• They can only return one output, so only work for binary
problems
• They are linear machines, so can only solve linear problems
• They can work for vector inputs
• They are too complex to train, so they can work with big
computers only

29
What is a potential problem with perceptrons?

• They can only return one output, so only work for binary
problems
• They are linear machines, so can only solve linear problems
• They can work for vector inputs
• They are too complex to train, so they can work with big
computers only

30
Perceptron

• Example of Perceptron: Learn the operation XOR

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
[Minsky and Papert, 1969]
There is no combination:

31
scikit-learn

https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.ht
ml

32
Minsky & Multi-layer perceptrons

• Interestingly, Minsky never said XOR is unsolvable by neural networks


• Only that XOR cannot be solves with 1 layer perceptrons
• Multi-layer perceptrons can solve XOR
• 9 year earlier Minsky built such a multi-layer perceptron
• Any continuous function on a compact subset of can be
approximated to any arbitrary degree of precision by a feed-forward multi-
layer perceptron with a single hidden layer containing a finite number of
neurons.
[Cyberno 1989, Hornik, 1991]
• However, how to train a multi-layer perceptron?
• Rosenblatt's algorithm not applicable

33
Multilayer Perceptron

• 1980- early 90s

Output of the Hidden layer

… Hidden
layer
.. Output of the Network

34
Minsky & Multi-layer perceptrons

• Interestingly, Minsky never said XOR is unsolvable by neural networks


• Only that XOR cannot be solves with 1 layer perceptrons
• Multi-layer perceptrons can solve XOR
• 9 year earlier Minsky built such a multi-layer perceptron
• Any continuous function on a compact subset of can be
approximated to any arbitrary degree of precision by a feed-forward multi-
layer perceptron with a single hidden layer containing a finite number of
neurons.
[Cyberno 1989, Hornik, 1991]
• However, how to train a multi-layer perceptron?
• Problem: how to train a multi-layer perceptron?
• Rosenbatt's algorithm not applicable
• It expects to know the ground truth z h ' for the z h
• For the output layers we have the ground truth labels
• For intermediate hidden layers we don't

35
From a single layer to multiple layers

• 1 perceptron == 1 decision
• What about multiple decisions?
• E.g. digit classification
• Stacks as many outputs as the possible
outcomes into a layer
• Neural Networks
• Use one layer as input to the next layer
• Add nonlinearities between layers
• Multi-layer perceptron (MLP)

36
Multilayer Perceptron

• XOR

37
Multilayer Perceptron

• Multiple hidden layers

...
...
...
38
Multilayer Perceptron

validation

Mean Square Error


validation
Mean Square Error

training training

Number of hidden units # epochs

https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

39
Multilayer Perceptron

• Dropout
• At each iteration, set half the units (randomly) to 0.
• Avoid overfitting.
• Helps focusing on informative features.

(Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov 2012)

40
Multilayer Perceptron

• Architecture
• Start with one hidden layer.
• Stop adding layers when you overfit
• Try not to use more weights than training samples.
• Deeper networks usually perform better than shallower
• No rule about the architecture. Active research area.

41
Multilayer Perceptron - Summary

• Perceptrons learn linear discriminants.


• Learning is done by weight update. [next slides]
• Multiple layer perceptrons with one hidden unit are universal
approximators.
• Neural Networks are hard to train, caution must be applied.

42
scikit-learn

https://scikit-
learn.org/stable/modules/neural_networks_supervised.html

43
Optimization

44
Optimization

45
Optimization

46
Backpropagation (Intuition)

• A simple example
+
-4
e.g. x=-2, y=5, z=-4
-4
-4
1

47
Backpropagation

• A simple example

48
Backpropagation

• A simple example

49
Backpropagation

• Generic update rule:

• After each training instance, for each weight:

50
Backpropagation

• Backwards propagation of errors

..

..

51
Backpropagation

• Backwards propagation of errors

Forward

52
Backpropagation

• Backwards propagation of errors

Forward

53
Backpropagation

• Backwards propagation of errors

Forward

Backward

54
Backpropagation

• Algorithm

..

..
Epoch: when all the training points have been
seen once.
55
Backpropagation

• Escaping saturation
• Large weights => saturation
• Weight initialization
• Random initialization with the numbers belonging to a normal
distribution

• Weight decay = regularization


• E → E + weight decay

56
Backpropagation - Summary

• Neural Nets will be very large: impractical to write down gradient


formula by hand for all parameters.
• Backprop: recursive application of the chain rule along a computational
graph to compute the gradients of all inputs/parameters/ intermediates
• Implementations maintain a graph structure, where the nodes
implement the forward() / backward() functions
• Forward() : compute result of an operation and save any intermediates
needed for gradient computation in memory.
• Backward(): apply the chain rule to compute the gradient of the loss
function with respect to the inputs.

57
Introduction to CNNs

• [LeCun et al. 1990]


• 4 hidden layers
• Shared weights

58
Introduction to CNNs

• [LeCun et al. 1990]


• Based on convolutions

59
Introduction to CNNs

• [LeCun et al. 1990]


• Based on convolutions

60
Types of (deep) neural networks

• Deep feed-forward (= multilayer perceptrons)


• Unsupervised networks
• Autoencoders/ variational autoencoders (VAE) - learn new
representation of the data
• deep belief networks (DBNs) - model the distribution of the data
but can add a supervised layer in the end
• generative adversarial networks (GANs) - learn to separate real
data from fake data they generate
• Convolutional neural networks (CNNs)
• For image/audio modeling
• Recurrent Neural Networks
• Nodes are fed information from the previous layer and also from
themselves (i.e. the past)
• Long short-term memory networks (LSTM) for sequence modeling
61
Introduction to CNNs

• Why now??
• Faster computers (GPUs)

• More training data

• Easy to use, supported by powerful libraries

62
Neural Networks packages

• Matlab
• Neural Network Toolbox
• Deep Learn Toolbox, Deep Belief Networks, …
• Python
• PyBrain, FANN
• Scikit-learn
• Deep learning: Pytorch, TensorFlow, …
• R
• Neuralnet
• nnet, deepnet, mxnet

63
Next Class

• Introduction to Deep Learning

64
Thank You!

65

You might also like