0% found this document useful (0 votes)
46 views66 pages

Lec2 Perceptron MLP

Uploaded by

kaydee140492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views66 pages

Lec2 Perceptron MLP

Uploaded by

kaydee140492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

FALL 2023

INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE

Tianyi Zhou

8/28/23
University of Maryland
Some slides adapted from Song & Russell
1
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks

Search & Perception


Planning
Reward, Data,
Observation
Neural networks
• Perceptron and MLP
• Optimization and Backpropagation
• Convolutional Neural Networks
• Recurrent Neural Networks and LSTM
• Attention and Transformer
• Graph Neural Networks

3
Search & Planning
• Uninformed Search
• Informed Search and A*
• Constraint Satisfaction
• Adversarial Search and Games
• Planning

8/28/23
4
• Detection
• Segmentation
• Captioning and VQA
• 3D Vision
• Video Understanding

Perception

8/28/23
5
Sample Footer Text

acting
• Markov Decision Process (MDP)
• Reinforcement Learning (RL)
• Imitation Learning

8/28/23
6
• Pre-training
• Finetuning
• Prompting
• Ensemble
• Understanding and Reasoning

Language models

7
Embodied &
Multi-modal ai
• Vision-Language Models
• LLM Agent
• Embodied Agent

8/28/23
• VLM/LLM + RL
8
Probabilistic REASONING &
Bayesian networks
• Bayesian Networks
• Probabilistic Inference
• Sampling

9
Course websites
• ELMS: https://umd.instructure.com/courses/1350035
– Announcements, lecture notes/recordings, syllabus, class schedule, final grades

• Piazza: https://piazza.com/class/llup7xc8lm44w4
– Any and all offline questions
– Use Piazza before emailing
– Option to send a private message (also available in ELMS)

• Gradescope: https://www.gradescope.com/courses/590315 (entry code: DP2ED3)


– Assignment submission and grading

• Zoom
– Final presentation and some office hours
Course materials
• Textbooks
– Russell & Norvig: https://aima.cs.berkeley.edu/ (primary)
– Sutton & Barto: http://www.incompleteideas.net/book/the-book-2nd.html
– Goodfellow, Bengio, & Courville: https://www.deeplearningbook.org/
– Murphy: https://probml.github.io/pml-book/
• Other AI Courses
– Berkeley CS188: https://inst.eecs.berkeley.edu/~cs188/archives
– Stanford CS221: https://stanford-cs221.github.io/spring2023/
– MIT Deep Learning: https://introtodeeplearning.com/
– Harvard CS50: https://cs50.harvard.edu/ai/2023/

• Online Materials:
– https://github.com/owainlewis/awesome-artificial-intelligence
Coursework and grading
• Four Assignments (no collaboration allowed, 2-3 weeks per assignment) (60%)
– Assignment-1: Neural Networks and Optimization
– Assignment-2: Search and Planning Algorithms
– Assignment-3: MDP, RL, and Imitation Learning
– Assignment-4: Perception and Language Modeling
• One Group Project (Group of 5 or less, everyone can have different credits) (60%)
– Mid-term Presentation (problem introduction + preliminary results + paper presentation) (15%)
– Weekly Report (10 reports from the 3rd week, excluding mid-term and final weeks) (20%)
– Final Technical Report (25%)
• One in-class closed-book Exam (Oct 30th) (20%)
– Topics mainly included in the Four Assignments
• Attendances (in-class signature) (20%)
• Final grade = max {Assignments, Project} + Exam + Attendances
Late/absence & re-grading policies
• If you submit by due date – no penalty
• Up to one hour late – 5% penalty
• Up to 24 hours late – 20% penalty
• 24-48 hours – 50% penalty
• Beyond 48 hours – 100% penalty

• If you must miss a class, please notify me via email 24 hours ahead of the class.
• If you need rearrangement of exams and presentations because of medical or health reasons, please notify
me via email 24 hours ahead of the exam/presentation (except emergency).

• Re-grading must be initiated within one week of receiving grades on Gradescope.


• You can initiate the grading by contacting the TAs.
• No changes in grades if initiated after one week of receiving grades.
University policies
• https://www.ugst.umd.edu/courserelatedpolicies.html
• Never, never, never plagiarize.
Special accommodations
• Any student who feels that they may need an accommodation
because of a disability (learning disability, attention deficit
disorder, psychological, physical, etc.), please make an
appointment to see me during office hours.
TODos
• Vote in the TA office hour poll
• Read the course materials for the next few lectures regarding neural networks.
• Talk to your classmates and find a group sharing common interests about the project (we will
announce the project topics soon)
• Learn and practice Python and PyTorch.
• Review Calculus and Algorithms learned before.
Welcome on board!
Neural networks
• Perceptron and MLP
• Optimization and Backpropagation
• Convolutional Neural Networks
• Recurrent Neural Networks and LSTM
• Attention and Transformer
• Graph Neural Networks

18
Plan
• Today
– Neural networks basics
• http://neuralnetworksanddeeplearning.com/
– Ch 1, 2, and 6
• Russell and Norvig
– Ch 18.7
– https://jeremykun.com/2012/12/09/neural-networks-and-backpropagation/

• Next couple of lectures


– Neural Networks
biological neurons
• Each nerve cell (neuron) is a processing unit of input signals.
• If the signal impulse is strong enough (activated), the neuron
send signals to the next neuron via neurotransmitter.
• The next neuron received the signal and continual the process.
Artificial neurons
• Artificial neurons in neural network imitate biological neurons.
• A neuron receives n inputs from previous neurons.
• Each neuron associates with n weights and computes a weighted sum of the n inputs ∑! 𝑤! 𝑥! .
• Activation function 𝜎 ⋅ determines the output 𝜎 ∑! 𝑤! 𝑥! based on the strength of the weighted sum.

• Bias term b:

y = 𝑏 + % 𝑤' 𝑥'
'
• How to determine
the weights?
• How to determine
the activation
function?
Perceptron Rosenblatt, 1958
x1 w1

x2 w2
output

x3 w3

• A perceptron computes the weighted sum of several binary inputs, x1,x2,…,


and produces a single binary output
• Use sign() as the activation function
• For example,
( P
0 if i wi xi  threshold
output = P
1 if i wi xi > threshold
Perceptron for classification
x0 = 1

x1 w1 b aka bias aka threshold

x2 w2
output

x3 w3

( P
0 if i w i xi + b  0
output = P
1 if i w i xi + b > 0
Perceptron as a Threshold function
1 w0

x1
w1 X
f(x) output

w2
x2

1. Find weighted sum of inputs


X
f (x) = th(z(x)) z(x) = w i xi
i
Output is a composition of two
functions (
0 if z  0
th(z) =
1 if z > 0
2. apply thresholding
Exercise: Design a perceptron that
functions as a NAND gate
x0 = 1

x1 w1 b

output

x2 w2

x1 x2 output
0 0 1
0 1 1
1 0 1
1 1 0
Exercise x0 = 1

x1 w1 b x1 x2 output
0 0 1
output 0 1 1
1 0 1
x2 w2
1 1 0

• Design a perceptron that functions as a NAND gate à what should w1, w2, b be?

( w1 w2 b
0 if w1 x1 + w2 x2 + b  0 1 1 1
output =
1 if w1 x1 + w2 x2 + b > 0
Perceptron and logical functions
x0 = 1

x1 -2 3

output
Output = 0, only when both inputs are 1
Otherwise, output = 1

x2 -2

• NAND gate is a universal gate


• Perceptrons can model any logical gates
Designing (training) a neuron

• Just like the perception case, we will assume that we know the inputs and
desired outputs
• Our goal will be to find the weights so that the output of the neuron
matches the desired outputs
How to Train a Perceptron?
• (1) If output<0 but y=P, add x to w; (2) If output>=0 but y=N, subtract x from w.
• Simple rule: w=w+yx
• Num. of Errors made by Perceptron Alg is upper bounded. http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge pdf,
.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html
Beyond Logical Functions
x1 w1

output

x0 w0

• However, we are interested in non-binary input and non-binary outputs like pixel values
• Perceptron – binary inputs and binary outputs
• Neurons – binary and/or real-valued inputs and binary and/or real-valued outputs
Perceptron as a Threshold function
1 w0

x1
w1 X
f(x) output

w2
x2

1. Find weighted sum of inputs


X
f (x) = th(z(x)) z(x) = w i xi
i
Output is a composition of two
functions (
0 if z  0
th(z) =
1 if z > 0
2. apply threshold function
Sigmoid Neuron
1 w0

x1
w1 X
f(x) output

w2
x2

1. Find weighted sum of inputs


X
f (x) = (z(x)) z(x) = w i xi
i
Output is a composition of two
functions

1
(z) = z
1+e
2. apply sigmoid function
Sigmoid Neuron
1 w0

x1 w1
X
f(x) output

w2
x2

1. Find weighted sum of inputs


f (x) = (z(x)) X
z(x) = w i xi
i
Output is a composition of two
functions

1
(z) = z
1+e
2. apply sigmoid function
Activation
functions
• Thresholding and filter function
• Increase nonlinearity and expressiveness
• Partition of the input space
• Many choices available and critical to performance
Neural connections
in animal brains
• Connectome (a map of neural connections) in brains
of 302 neurons in <1mm worms Caenorhabditis
elegans.
• Much more complicated in human brains.
• Architecture of neural networks is important (multi-
layer, convolution, attention, recurrent, graph).
Multi-layer perceptron - MLP
w1 w13
w2 w14
w3
inputs w4
output

one or more
hidden layers

• A single neuron cannot model interesting functions


• But a network of neurons can…model any* function
• Finding weights by trial and error won’t work
• We will study a technique to learn the weights
Linear Regions of MLP
• A neural networks recursively applying a nonlinear activation 𝜎 on a pathway (a sequence of neurons), i.e.,
𝑓 𝑥 = 𝑓" (𝜎[𝑓"#$ (𝜎[𝑓"#% (⋯ 𝑓% 𝜎 𝑓$ 𝑥 )])]).
• If 𝜎 is piece-wise linear (e.g., ReLU), the neural network partitions the input space into exponential
number of polytopes and applies a linear function to data in each polytope.
Neural Networks are
Piecewise-Linear
• Exponential number of polytopes but every polytope only have at most one data point.
• How does generalization come from? Shared facets/edges between polytopes.
Neural Networks are Piecewise-Linear
Training Piecewise-Linear Neural
Network
two ridges bump

inputs
output
Multi-layer perceptron

input layer (activation


output layer
function = identity)

one or more
hidden layers

• Directed graph of neurons


• Each neuron has some activation function
• The activation from previous layer is passed as the input to the next layer
Multi-layer perceptron

input layer (activation


output layer
function = identity)

one or more
hidden layers

• The output of each neuron is the weighted sum of all inputs passed
through an activation function
• The final output is passed through many, many compositions of functions
• http://playground.tensorflow.org/
Neural networks are
function approximators
• Use (x, y) data to learn a neural network f(x) towards y=f(x)

Prediction/Regression Classification
output is a continuous variable output is a discrete label
02 31 02 31
is weekend? is weekend?
B6 is holiday? 7C B6 is holiday? 7C
B6 7C B6 7C
B6 weather 7C B6 weather 7C
f B6 7C ! sales forecast f B6 7C ! is cat?
B6area income7C B6area income7C
@4 5A @4 5A
.. ..
. .

In RL, we will use NN for regression f(x) ß Q( [s,a] )


Supervised Learning
w1 w13
w2 w14
w3 output layer (can
input layer (activation w4
have multiple
function = identity) outputs)

one or more
hidden layers
Training
Training inputs output
x1 x2 x3 y
Given training inputs and
0.6 0 -1 10
outputs, find weights so that
0.4 1.2 3.1 21
network output matches
1.7 0.6 -0.3 2.1
training outputs. How?
1 1 0 0.1
What should w1 ,w2 , b be?
• Want f(x) = y, how to measure their difference?
• Minimize a loss function 𝐸 𝑤 = ∑&,( ||𝑦 − 𝑓 𝑥 ||% x0 = 1

• You likely did the following: x1 w1 b

– started with some values of w1, w2, b say 1, 1, 0


– computed f(x) using w1 = 1, w2 = 1, b = 0 f(x)

– checked if f(x) matched y, i.e., if the loss E(w) is small


x2 w2
– if not, you adjusted w1, w2, b slightly
– rinse and repeat until f(x) matched y

x1 x2 y
(
0 0 1 0 if w1 x1 + w2 x2 + b  0
0 1 1 f (x) =
1 if w1 x1 + w2 x2 + b > 0
1 0 1
1 1 0
Optimization of loss function

E(w)

w2
w1
Different Optimization Algorithms
Loss landscape of DNN is complicated
https://tinyurl.com/2fe26wsn
General Recipe for optimization
Training inputs Training output • Start with some initial weights w
x1 x2 x3 y
• Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
1.7 0.6 -0.3 2.1 Compute E(w)

1 1 0 0.1
• Adjust the weights – how?
• Rinse and repeat
w1 w13
w2 w14
w w3
4
How to adjust weights?
• Let wold = [w1, w2, w3, … ]T be the current set of weights
✓ ◆
• Find @E @E
rE(wold ) = , ,...
@w0 @w1
• ∇𝐸 is a vector pointing in the direction of the local
minima of E from wold
• Update the weights so that you move in the direction
of this vector (i.e., towards the minima)
w2
wnew = wold ↵rE(wold ) w1
General Recipe
Training inputs Training output • Start with some initial weights
x1 x2 x3 y • Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
Compute E(w)
1.7 0.6 -0.3 2.1
• Adjust the weights – how?
1 1 0 0.1
Gradient descent
w1 w13
• Rinse and repeat
w2 w14
w w3
4
Gradient Descent

w2
• We still need to compute w1
✓ ◆
@E @E
rE(wold ) = , ,...
@w0 @w1
• I’ll show how to do this for a single sigmoid neuron
• Then, we will see how to do this for a neural network – backpropagation
• Most libraries implement automatic differentiation
Sigmoid Neuron
1 w0
fw(x) output
x1 w1
X

w2
x2

fw (x) = (z(x)) 1X j
E(w) = ky fw (xj )k2
2 j
1
(z) =
1+e z need to compute

X ✓ ◆
@E @E
z(x) = w i xi rE(wold ) = , ,...
@w0 @w1
i
Finding the weights
fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E
=?
@wi
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E @E @ @z
= · ·
@wi @ @z @wi

Use chain rule


Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi

d @z j
Use chain rule = (z)(1 (z)) = xi
dz @wi
xji is the i’th term of vector xj
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
all three terms are the same – output of the neuron fw(xj)

d @z
= (z)(1 (z)) = xji
dz @wi
xji is the i’th term of vector xj
Rewriting the equations
@E X d @z @z
= yj j
(z(x )) = xji
@wi dz @wi @wi
j
✓ ◆
d @E @E
= (z)(1 (z)) rE(w) = , ,...
dz @w0 @w1

Let o , (z(x ))
j j output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Gradient Descent
change in wi is proportional to gradient of loss wrt wi
✓ ◆
@E @E
rE(w) = , ,...
@w0 @w1

Let oj , (z(xj )) output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Let oj , (z(xj ))

A more specific recipe


• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the neuron fw(xj) (call it oj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

@E X
= (y j oj )oj (1 oj )xji
@wi j
– Update each weight wi based on the gradient

@E
wi = wi ⌘
@wi
Neural Network - MLP

input layer (activation


output layer
function = identity)

hidden layer

• Directed graph of neurons


• Each edge corresponds to a unique weight
• We need to learn all the weights
Training one neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the neuron fw(xj) (call it oj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

@E X
= (y j oj )oj (1 oj )xji
@wi j
– Update each weight wi based on the gradient

@E
wi = wi ⌘
@wi
Same idea as single neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the network fw(xj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

• slightly more complicated equation

– Update each weight wi based on the gradient

@E
wi = wi ⌘
@wi
Intuition
• The math becomes complicated. I don’t expect you to know the equations off the top of your mind.
• But it’s important to understand the intuition.

• Lots of resources online: here’s a good one… https://www.youtube.com/watch?v=Ilg3gGewQ5U


Designing a NN
• Remember, the only unknowns are the weights
• Everything else should be decided by you

• How many layers?


• How many neurons in each layer?
• How to connect the neurons?
• What activation function to use?
• What loss function to use?
• What gradient descent method to use?

You might also like