Lec2 Perceptron MLP
Lec2 Perceptron MLP
INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE
Tianyi Zhou
8/28/23
University of Maryland
Some slides adapted from Song & Russell
1
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks
3
Search & Planning
• Uninformed Search
• Informed Search and A*
• Constraint Satisfaction
• Adversarial Search and Games
• Planning
8/28/23
4
• Detection
• Segmentation
• Captioning and VQA
• 3D Vision
• Video Understanding
Perception
8/28/23
5
Sample Footer Text
acting
• Markov Decision Process (MDP)
• Reinforcement Learning (RL)
• Imitation Learning
8/28/23
6
• Pre-training
• Finetuning
• Prompting
• Ensemble
• Understanding and Reasoning
Language models
7
Embodied &
Multi-modal ai
• Vision-Language Models
• LLM Agent
• Embodied Agent
8/28/23
• VLM/LLM + RL
8
Probabilistic REASONING &
Bayesian networks
• Bayesian Networks
• Probabilistic Inference
• Sampling
9
Course websites
• ELMS: https://umd.instructure.com/courses/1350035
– Announcements, lecture notes/recordings, syllabus, class schedule, final grades
• Piazza: https://piazza.com/class/llup7xc8lm44w4
– Any and all offline questions
– Use Piazza before emailing
– Option to send a private message (also available in ELMS)
• Zoom
– Final presentation and some office hours
Course materials
• Textbooks
– Russell & Norvig: https://aima.cs.berkeley.edu/ (primary)
– Sutton & Barto: http://www.incompleteideas.net/book/the-book-2nd.html
– Goodfellow, Bengio, & Courville: https://www.deeplearningbook.org/
– Murphy: https://probml.github.io/pml-book/
• Other AI Courses
– Berkeley CS188: https://inst.eecs.berkeley.edu/~cs188/archives
– Stanford CS221: https://stanford-cs221.github.io/spring2023/
– MIT Deep Learning: https://introtodeeplearning.com/
– Harvard CS50: https://cs50.harvard.edu/ai/2023/
• Online Materials:
– https://github.com/owainlewis/awesome-artificial-intelligence
Coursework and grading
• Four Assignments (no collaboration allowed, 2-3 weeks per assignment) (60%)
– Assignment-1: Neural Networks and Optimization
– Assignment-2: Search and Planning Algorithms
– Assignment-3: MDP, RL, and Imitation Learning
– Assignment-4: Perception and Language Modeling
• One Group Project (Group of 5 or less, everyone can have different credits) (60%)
– Mid-term Presentation (problem introduction + preliminary results + paper presentation) (15%)
– Weekly Report (10 reports from the 3rd week, excluding mid-term and final weeks) (20%)
– Final Technical Report (25%)
• One in-class closed-book Exam (Oct 30th) (20%)
– Topics mainly included in the Four Assignments
• Attendances (in-class signature) (20%)
• Final grade = max {Assignments, Project} + Exam + Attendances
Late/absence & re-grading policies
• If you submit by due date – no penalty
• Up to one hour late – 5% penalty
• Up to 24 hours late – 20% penalty
• 24-48 hours – 50% penalty
• Beyond 48 hours – 100% penalty
• If you must miss a class, please notify me via email 24 hours ahead of the class.
• If you need rearrangement of exams and presentations because of medical or health reasons, please notify
me via email 24 hours ahead of the exam/presentation (except emergency).
18
Plan
• Today
– Neural networks basics
• http://neuralnetworksanddeeplearning.com/
– Ch 1, 2, and 6
• Russell and Norvig
– Ch 18.7
– https://jeremykun.com/2012/12/09/neural-networks-and-backpropagation/
• Bias term b:
y = 𝑏 + % 𝑤' 𝑥'
'
• How to determine
the weights?
• How to determine
the activation
function?
Perceptron Rosenblatt, 1958
x1 w1
x2 w2
output
x3 w3
x2 w2
output
x3 w3
( P
0 if i w i xi + b 0
output = P
1 if i w i xi + b > 0
Perceptron as a Threshold function
1 w0
x1
w1 X
f(x) output
w2
x2
x1 w1 b
output
x2 w2
x1 x2 output
0 0 1
0 1 1
1 0 1
1 1 0
Exercise x0 = 1
x1 w1 b x1 x2 output
0 0 1
output 0 1 1
1 0 1
x2 w2
1 1 0
• Design a perceptron that functions as a NAND gate à what should w1, w2, b be?
( w1 w2 b
0 if w1 x1 + w2 x2 + b 0 1 1 1
output =
1 if w1 x1 + w2 x2 + b > 0
Perceptron and logical functions
x0 = 1
x1 -2 3
output
Output = 0, only when both inputs are 1
Otherwise, output = 1
x2 -2
• Just like the perception case, we will assume that we know the inputs and
desired outputs
• Our goal will be to find the weights so that the output of the neuron
matches the desired outputs
How to Train a Perceptron?
• (1) If output<0 but y=P, add x to w; (2) If output>=0 but y=N, subtract x from w.
• Simple rule: w=w+yx
• Num. of Errors made by Perceptron Alg is upper bounded. http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge pdf,
.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html
Beyond Logical Functions
x1 w1
output
x0 w0
• However, we are interested in non-binary input and non-binary outputs like pixel values
• Perceptron – binary inputs and binary outputs
• Neurons – binary and/or real-valued inputs and binary and/or real-valued outputs
Perceptron as a Threshold function
1 w0
x1
w1 X
f(x) output
w2
x2
x1
w1 X
f(x) output
w2
x2
1
(z) = z
1+e
2. apply sigmoid function
Sigmoid Neuron
1 w0
x1 w1
X
f(x) output
w2
x2
1
(z) = z
1+e
2. apply sigmoid function
Activation
functions
• Thresholding and filter function
• Increase nonlinearity and expressiveness
• Partition of the input space
• Many choices available and critical to performance
Neural connections
in animal brains
• Connectome (a map of neural connections) in brains
of 302 neurons in <1mm worms Caenorhabditis
elegans.
• Much more complicated in human brains.
• Architecture of neural networks is important (multi-
layer, convolution, attention, recurrent, graph).
Multi-layer perceptron - MLP
w1 w13
w2 w14
w3
inputs w4
output
one or more
hidden layers
inputs
output
Multi-layer perceptron
one or more
hidden layers
one or more
hidden layers
• The output of each neuron is the weighted sum of all inputs passed
through an activation function
• The final output is passed through many, many compositions of functions
• http://playground.tensorflow.org/
Neural networks are
function approximators
• Use (x, y) data to learn a neural network f(x) towards y=f(x)
Prediction/Regression Classification
output is a continuous variable output is a discrete label
02 31 02 31
is weekend? is weekend?
B6 is holiday? 7C B6 is holiday? 7C
B6 7C B6 7C
B6 weather 7C B6 weather 7C
f B6 7C ! sales forecast f B6 7C ! is cat?
B6area income7C B6area income7C
@4 5A @4 5A
.. ..
. .
one or more
hidden layers
Training
Training inputs output
x1 x2 x3 y
Given training inputs and
0.6 0 -1 10
outputs, find weights so that
0.4 1.2 3.1 21
network output matches
1.7 0.6 -0.3 2.1
training outputs. How?
1 1 0 0.1
What should w1 ,w2 , b be?
• Want f(x) = y, how to measure their difference?
• Minimize a loss function 𝐸 𝑤 = ∑&,( ||𝑦 − 𝑓 𝑥 ||% x0 = 1
x1 x2 y
(
0 0 1 0 if w1 x1 + w2 x2 + b 0
0 1 1 f (x) =
1 if w1 x1 + w2 x2 + b > 0
1 0 1
1 1 0
Optimization of loss function
E(w)
w2
w1
Different Optimization Algorithms
Loss landscape of DNN is complicated
https://tinyurl.com/2fe26wsn
General Recipe for optimization
Training inputs Training output • Start with some initial weights w
x1 x2 x3 y
• Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
1.7 0.6 -0.3 2.1 Compute E(w)
1 1 0 0.1
• Adjust the weights – how?
• Rinse and repeat
w1 w13
w2 w14
w w3
4
How to adjust weights?
• Let wold = [w1, w2, w3, … ]T be the current set of weights
✓ ◆
• Find @E @E
rE(wold ) = , ,...
@w0 @w1
• ∇𝐸 is a vector pointing in the direction of the local
minima of E from wold
• Update the weights so that you move in the direction
of this vector (i.e., towards the minima)
w2
wnew = wold ↵rE(wold ) w1
General Recipe
Training inputs Training output • Start with some initial weights
x1 x2 x3 y • Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
Compute E(w)
1.7 0.6 -0.3 2.1
• Adjust the weights – how?
1 1 0 0.1
Gradient descent
w1 w13
• Rinse and repeat
w2 w14
w w3
4
Gradient Descent
w2
• We still need to compute w1
✓ ◆
@E @E
rE(wold ) = , ,...
@w0 @w1
• I’ll show how to do this for a single sigmoid neuron
• Then, we will see how to do this for a neural network – backpropagation
• Most libraries implement automatic differentiation
Sigmoid Neuron
1 w0
fw(x) output
x1 w1
X
w2
x2
fw (x) = (z(x)) 1X j
E(w) = ky fw (xj )k2
2 j
1
(z) =
1+e z need to compute
X ✓ ◆
@E @E
z(x) = w i xi rE(wold ) = , ,...
@w0 @w1
i
Finding the weights
fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i
@E
=?
@wi
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i
@E @E @ @z
= · ·
@wi @ @z @wi
@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
d @z j
Use chain rule = (z)(1 (z)) = xi
dz @wi
xji is the i’th term of vector xj
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i
@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
all three terms are the same – output of the neuron fw(xj)
d @z
= (z)(1 (z)) = xji
dz @wi
xji is the i’th term of vector xj
Rewriting the equations
@E X d @z @z
= yj j
(z(x )) = xji
@wi dz @wi @wi
j
✓ ◆
d @E @E
= (z)(1 (z)) rE(w) = , ,...
dz @w0 @w1
Let o , (z(x ))
j j output of the neuron for input xj
@E X
= (y j oj )oj (1 oj )xji
@wi j
Update rule: @E
wi0 = wi ⌘
@wi
Gradient Descent
change in wi is proportional to gradient of loss wrt wi
✓ ◆
@E @E
rE(w) = , ,...
@w0 @w1
@E X
= (y j oj )oj (1 oj )xji
@wi j
Update rule: @E
wi0 = wi ⌘
@wi
Let oj , (z(xj ))
@E X
= (y j oj )oj (1 oj )xji
@wi j
– Update each weight wi based on the gradient
@E
wi = wi ⌘
@wi
Neural Network - MLP
hidden layer
@E X
= (y j oj )oj (1 oj )xji
@wi j
– Update each weight wi based on the gradient
@E
wi = wi ⌘
@wi
Same idea as single neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the network fw(xj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi
@E
wi = wi ⌘
@wi
Intuition
• The math becomes complicated. I don’t expect you to know the equations off the top of your mind.
• But it’s important to understand the intuition.