0% found this document useful (0 votes)
68 views25 pages

Lecture Notes For Chapter 4 Artificial Neural Networks: Data Mining

The document discusses artificial neural networks (ANN) and deep learning. It introduces the basic idea of ANN as networks of simple processing units that can learn complex nonlinear functions. The simplest ANN is the perceptron, which learns linear decision boundaries. Multi-layer neural networks can learn nonlinear functions using techniques like backpropagation to calculate gradients and update weights. Recent trends in deep learning allow training very deep neural networks with many layers to learn complex hierarchical features from data.

Uploaded by

Ishwar Mht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views25 pages

Lecture Notes For Chapter 4 Artificial Neural Networks: Data Mining

The document discusses artificial neural networks (ANN) and deep learning. It introduces the basic idea of ANN as networks of simple processing units that can learn complex nonlinear functions. The simplest ANN is the perceptron, which learns linear decision boundaries. Multi-layer neural networks can learn nonlinear functions using techniques like backpropagation to calculate gradients and update weights. Recent trends in deep learning allow training very deep neural networks with many layers to learn complex hierarchical features from data.

Uploaded by

Ishwar Mht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Mining

Lecture Notes for Chapter 4

Artificial Neural Networks

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

10/12/2020 Introduction to Data Mining, 2nd Edition 1


Artificial Neural Networks (ANN)

Basic Idea: A complex non-linear function can be


learned as a composition of simple processing units
ANN is a collection of simple processing units
(nodes) that are connected by directed links (edges)
– Every node receives signals from incoming edges,
performs computations, and transmits signals to
outgoing edges
– Analogous to human brain where nodes are neurons
and signals are electrical impulses
– Weight of an edge determines the strength of
connection between the nodes
– Simplest ANN: Perceptron (single neuron)
10/12/2020 Introduction to Data Mining, 2nd Edition 2
Basic Architecture of Perceptron

Activation Function

Learns linear decision boundaries


Similar to logistic regression (activation function is sign
instead of sigmoid)
10/12/2020 Introduction to Data Mining, 2nd Edition 3
Perceptron Example

X1 X2 X3 Y
1 0 0 -1
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 -1
0 1 0 -1
0 1 1 1
0 0 0 -1

Output Y is 1 if at least two of the three inputs are equal to 1.

10/12/2020 Introduction to Data Mining, 2nd Edition 4


Perceptron Example

X1 X2 X3 Y
1 0 0 -1
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 -1
0 1 0 -1
0 1 1 1
0 0 0 -1

Y  sign ( 0 . 3 X 1  0 . 3 X 2  0 . 3 X 3  0 . 4 )
 1 if x  0
where sign ( x )  
 1 if x  0
10/12/2020 Introduction to Data Mining, 2nd Edition 5
Perceptron Learning Rule

Initialize the weights (w0, w1, …, wd)


Repeat
– For each training example (xi, yi)
 Compute
 Update the weights:

Until stopping condition is met


k: iteration number; : learning rate

10/12/2020 Introduction to Data Mining, 2nd Edition 6


Perceptron Learning Rule

Weight update formula:

Intuition:
– Update weight based on error: e =
– If y = , e=0: no update needed
– If y > , e=2: weight must be increased so
that will increase
– If y < , e=-2: weight must be decreased so
that will decrease
10/12/2020 Introduction to Data Mining, 2nd Edition 7
Example of Perceptron Learning

  0.1
X 1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2
Weight updates over
Weight updates over first epoch all epochs

10/12/2020 Introduction to Data Mining, 2nd Edition 8


Perceptron Learning

Since y is a linear
combination of input
variables, decision
boundary is linear

10/12/2020 Introduction to Data Mining, 2nd Edition 9


Perceptron Learning

Since y is a linear
combination of input
variables, decision
boundary is linear

For nonlinearly separable problems, perceptron


learning algorithm will fail because no linear
hyperplane can separate the data perfectly

10/12/2020 Introduction to Data Mining, 2nd Edition 10


Nonlinearly Separable Data

XOR Data

y  x1  x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1

10/12/2020 Introduction to Data Mining, 2nd Edition 11


Multi-layer Neural Network

More than one hidden layer of


computing nodes

Every node in a hidden layer


operates on activations from
preceding layer and transmits
activations forward to nodes of
next layer

Also referred to as
“feedforward neural networks”

10/12/2020 Introduction to Data Mining, 2nd Edition 12


Multi-layer Neural Network

Multi-layer neural networks with at least one


hidden layer can solve any type of classification
task involving nonlinear decision surfaces
XOR Data

10/12/2020 Introduction to Data Mining, 2nd Edition 13


Why Multiple Hidden Layers?

Activations at hidden layers can be viewed as features


extracted as functions of inputs
Every hidden layer represents a level of abstraction
– Complex features are compositions of simpler features

Number of layers is known as depth of ANN


– Deeper networks express complex hierarchy of features

10/12/2020 Introduction to Data Mining, 2nd Edition 14


Multi-Layer Network Architecture


Activation value Activation


at node i at layer l Function Linear Predictor

10/12/2020 Introduction to Data Mining, 2nd Edition 15


Activation Functions

10/12/2020 Introduction to Data Mining, 2nd Edition 16


Learning Multi-layer Neural Network

Can we apply perceptron learning rule to each


node, including hidden nodes?
– Perceptron learning rule computes error term
e = y - and updates weights accordingly
 Problem: how to determine the true value of y for
hidden nodes?
– Approximate error in hidden nodes by error in
the output nodes
 Problem:
– Not clear how adjustment in the hidden nodes affect overall
error
– No guarantee of convergence to optimal solution

10/12/2020 Introduction to Data Mining, 2nd Edition 17


Gradient Descent

Loss Function to measure errors across all training points


Squared Loss:

Gradient descent: Update parameters in the direction of


“maximum descent” in the loss function across all points

: learning rate

Stochastic gradient descent (SGD): update the weight for every


instance (minibatch SGD: update over min-batches of instances)

10/12/2020 Introduction to Data Mining, 2nd Edition 18


Computing Gradients
=

Using chain rule of differentiation (on a single instance):

For sigmoid activation function:

How can we compute for every layer?


10/12/2020 Introduction to Data Mining, 2nd Edition 19
Backpropagation Algorithm

At output layer L:

At a hidden layer (using chain rule):

– Gradients at layer l can be computed using gradients at layer l + 1


– Start from layer L and “backpropagate” gradients to all previous
layers
Use gradient descent to update weights at every epoch
For next epoch, use updated weights to compute loss fn. and its gradient
Iterate until convergence (loss does not change)
10/12/2020 Introduction to Data Mining, 2nd Edition 20
Design Issues in ANN

Number of nodes in input layer


– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k
values
Number of nodes in output layer
– One output for binary class problem
– k or log2 k nodes for k-class problem
Number of hidden layers and nodes per layer
Initial weights and biases
Learning rate, max. number of epochs, mini-batch size for
mini-batch SGD, …

10/12/2020 Introduction to Data Mining, 2nd Edition 21


Characteristics of ANN

Multilayer ANN are universal approximators but could


suffer from overfitting if the network is too large
Gradient descent may converge to local minimum
Model building can be very time consuming, but testing
can be very fast
Can handle redundant and irrelevant attributes because
weights are automatically learnt for all attributes
Sensitive to noise in training data
Difficult to handle missing attributes

10/12/2020 Introduction to Data Mining, 2nd Edition 22


Deep Learning Trends

Training deep neural networks (more than 5-10 layers)


could only be possible in recent times with:
– Faster computing resources (GPU)
– Larger labeled training sets
– Algorithmic Improvements in Deep Learning
Recent Trends:
– Specialized ANN Architectures:
Convolutional Neural Networks (for image data)
Recurrent Neural Networks (for sequence data)
Residual Networks (with skip connections)
– Unsupervised Models: Autoencoders
– Generative Models: Generative Adversarial Networks
10/12/2020 Introduction to Data Mining, 2nd Edition 23
Vanishing Gradient Problem

Sigmoid activation function easily saturates (show zero gradient


with z) when z is too large or too small
Lead to small (or zero) gradients of squared loss with weights,
especially at hidden layers, leading to slow (or no) learning

10/12/2020 Introduction to Data Mining, 2nd Edition 24


Handling Vanishing Gradient Problem

Use of Cross-entropy loss function

Use of Rectified Linear Unit (ReLU) Activations:

10/12/2020 Introduction to Data Mining, 2nd Edition 25

You might also like