Chapter 11
Neural Network
Basic idea
• Combine input information in a complex & flexible neural
net “model”
• Model “coefficients” are continually tweaked in an
iterative process
• The network’s interim performance in classification and
prediction informs successive tweaks
Basic idea
X1 X2 X3 Y Input Black box
1 0 0 0
1 0 1 1 X1
1 1 0 1 Output
1 1 1 1
0 0 1 0
X2 Y
0 1 0 0
0 1 1 1 X3
0 0 0 0
Output Y is 1 if at least two of the three inputs are equal to 1.
Basic idea
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1 X1 0.3 node
1 1 0 1
1 1 1 1
X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
Y I (0.3 X 1 0.3 X 2 0.3 X 3 0.4 0)
1 if z is true
where I ( z )
0 otherwise
Basic idea
• Model is an assembly of
inter-connected nodes
and weighted links
Input
• Output node sums up nodes Black box
each of its input value X1
Output
w1 node
according to the weights w2
of its links X2 Y
w3
• Compare output node X3 t
against some threshold
t
General structure
• Multiple layers
- Input layer (raw observations)
- Hidden layers
- Output layer
• Nodes
• Weights (like coefficients, subject to iterative
adjustment)
• Bias values (also like coefficients, but not
subject to iterative adjustment)
General structure
x1 x2 x3 x4 x5
Input Neuron i Output
Input
Layer
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
g(Si )
I3
Hidden
Layer
threshold, t
Output
Layer Training ANN means learning
the weights of the neurons
y
Multi-layer perceptron
Learning algorithm
• Initialize the weights (w0, w1, …, wk)
• Adjust the weights in such a way that the output of ANN
is consistent with class labels of training examples
- Objective function:
E Yi f ( wi , X i )
2
i
- Find the weights wi’s that minimize the above objective
function
e.g., backpropagation algorithm
Example: Fat & Salt Content to Predict
Consumer Acceptance of Cheese
Example: Data
Moving through the
Network
Input layer
• For input layer, input = output
- e.g., for record #1:
Fat input = output = 0.2
Salt input = output = 0.9
• Output of input layer = input into hidden
layer
Hidden layer
• In this example, it has 3 nodes
• Each node receives as input the output
of all input nodes
• Output of each hidden node is a
function of the weighted sum of inputs
p
𝑜𝑢𝑡𝑝𝑢𝑡 𝑗=𝑔(Θ j + ∑ wij x i)
i=1
Weights
• The weights q (theta) and w are typically
initialized to random values in the range
-0.05 to +0.05
• Equivalent to a model with random
prediction (in other words, no predictive
value)
• These initial weights are used in the first
round of training
Initial pass of the network
Output of node 3
if g is a Logistic function
p
𝑜𝑢𝑡𝑝𝑢𝑡 𝑗=𝑔(Θ j + ∑ wij x i)
i=1
1
𝑜𝑢𝑡𝑝𝑢𝑡 3 = ¿¿
1 +𝑒 ¿
Output layer
• The output of the last hidden
• layer
becomes input for the output layer
• Uses same function as above, i.e. a
function g of the weighted average
1
𝑜𝑢𝑡𝑝𝑢𝑡 6 =
1 +𝑒¿¿ ¿
Output node
If cutoff for “1” is 0.5,
then classify as “1”
Relation to linear regression
• A net with a single output node and no hidden layers,
where g is the identity function, takes the same form as
a linear regression model
p
^𝑦 =Θ+ ∑ wi xi
i=1
Training the model
Preprocessing Steps
• Scale variables to 0-1
• Categorical variables
- If equidistant categories, map to equidistant
interval points in 0-1 range
- Otherwise, create dummy variables
• Transform (e.g., log) skewed variables
Initial Pass Through Network
• Goal: Find weights that yield best
predictions
• The process we described below is
repeated for all records
- At each record compare prediction to actual
- Difference is the error for the output node
- Error is propagated back and distributed to all
the hidden nodes and used to update their
weights
Back Propagation (“back-prop”)
• Output from output node k: ^𝑦 𝑘
• Error associated with that node:
¿
Error is Used to Update Weights
𝑛𝑒𝑤 𝑜𝑙𝑑
𝜃 𝑗 =𝜃 𝑗 +𝑙 ( 𝑒𝑟𝑟 𝑗 )
¿
l = constant between 0 and 1, reflects the “learning
rate” or “weight decay parameter”
Case Updating
• Weights are updated after each record
is run through the network
• Completion of all records through the
network is one epoch (also called
sweep or iteration)
• After one epoch is completed, return to
first record and repeat the process
Batch Updating
• All records in the training set are fed to
the network before updating takes
place
• In this case, the error used for updating
is the sum of all errors from all records
Common Criteria to Stop Updating
• When weights change very little from
one iteration to the next
• When the misclassification rate reaches
a required threshold
• When a limit on runs is reached
XLMiner Output: Final Weights
Note: XLMiner uses two output nodes (P[1] and P[0]); diagrams show just one output
node (P[1])
Fat/Salt Example: Final Weights
XLMiner: Final Classifications
Avoiding Overfitting
• With sufficient hidden nodes and training
iterations, neural net can easily overfit
the data
• To avoid overfitting:
- Track error in validation data
- Limit iterations
- Limit complexity of network
User Inputs
Specify Network Architecture
• Number of hidden layers
- Most popular – one hidden layer
• Number of nodes in hidden layer(s)
- More nodes capture complexity, but increase
chances of overfit
• Number of output nodes
- For classification, one node per class (in binary
case can also use one)
- For numerical prediction use one
Network Architecture, cont.
• Learning Rate l
- Low values “downweight” the new
information from errors at each iteration
- This slows learning, but reduces tendency
to overfit to local structure
Advantages
• Good predictive ability
• Can capture complex relationships
• No need to specify a model
Disadvantages
• Considered a “black box” prediction machine,
with no insight into relationships between
predictors and outcome
• No variable-selection mechanism, so you have
to exercise care in selecting variables
• Heavy computational requirements if there are
many variables (additional variables
dramatically increase the number of weights to
calculate)
Summary
Neural networks can be used for classification and
prediction
Can capture a very flexible/complicated relationship
between the outcome and a set of predictors
The network “learns” and updates its model iteratively
as more data are fed into it
Major danger: overfitting
Requires large amounts of data
Good predictive performance, yet “black box” in nature