0% found this document useful (0 votes)
12 views

Week4_LearningII

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Week4_LearningII

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CSCI218: Foundations

of Artificial Intelligence
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)

2
Deep Learning/Neural Network

Image Classification
Very loose inspiration: Human neurons

Axonal arborization

Axon from another cell

Synapse
Dendrite Axon

Nucleus

Synapses

Cell body or Soma


Simple model of a neuron (McCulloch & Pitts, 1943)
Bias Weight
a0 = 1 aj = g(inj)
w0,j
g
wi,j inj
ai
Σ aj

Input Input Activation Output


Links Function Function Output Links

§ Inputs ai come from the output of node i to this node j (or from “outside”)
§ Each input link has a weight wi,j
§ There is an additional fixed input a0 with bias weight w0,j
§ The total input is inj = Si wi,j ai
§ The output is aj = g(inj) = g(Si wi,j ai) = g(w.a)
Activation functions g
g(ini) g(ini)

+1 +1

ini ini
(a)
Threshold (b)1/(1+e-x)
Sigmoid
Reminder: Linear Classifiers

▪ Inputs are feature values


▪ Each feature has a weight
▪ Sum is the activation

▪ If the activation is: f1


w1
▪ Positive, output +1 w2
▪ Negative, output -1
f2
w3 Σ >0?
f3
How to get probabilistic decisions?

If very positive, want probability going to 1


If very negative, want probability going to 0

Sigmoid function
Best w?
Maximum likelihood estimation:

with:

= Logistic Regression
Multiclass Logistic Regression
Multi-class linear classification
A weight vector for each class:

Score (activation) of a class y:

Prediction w/highest score wins:

How to make the scores into probabilities?

original activations softmax activations


Best w?
Maximum likelihood estimation:

with:

= Multi-Class Logistic Regression


Optimization

Optimization

i.e., how do we solve:


Hill Climbing
A simple, general idea
Start wherever
Repeat: move to the best neighboring state
If no neighbors better than current, quit

What’s particularly tricky when hill-climbing for multiclass


logistic regression?
• Optimization over a continuous space
• Infinitely many neighbors!
• How to do this efficiently?
1-D Optimization

Could evaluate and


Then step in best direction

Or, evaluate derivative:


Tells which direction to step into
2-D Optimization

Source: offconvex.org
Gradient Ascent
Perform update in uphill direction for each coordinate
The steeper the slope (i.e. the higher the derivative) the bigger the step
for that coordinate
E.g., consider:

Updates: ▪ Updates in vector notation:

with: = gradient
Steepest Descent
o Idea:
o Start somewhere
o Repeat: Take a step in the steepest descent direction

Figure source: Mathworks


Steepest Direction
o Steepest Direction = direction of the gradient

2 @g
3
@w1
6 @g7
6 @w2
7
rg = 6 7
4 ··· 5
@g
@wn
Optimization Procedure: Gradient Ascent

init
for iter = 1, 2, …

▪ : learning rate --- hyperparameter that needs to be chosen


carefully
Batch Gradient Ascent on the Log Likelihood Objective

init
for iter = 1, 2, …
Stochastic Gradient Ascent on the Log Likelihood Objective

Observation: once gradient on one training example has been


computed, might as well incorporate before computing next one

init
for iter = 1, 2, …
pick random j
Mini-Batch Gradient Ascent on the Log Likelihood Objective

Observation: gradient over small set of training examples (=mini-batch)


can be computed in parallel, might as well do that instead of a single one

init
for iter = 1, 2, …
pick random subset of training examples J
Neural Networks
Multi-class Logistic Regression
= special case of neural network (single layer, no hidden layer)
f1(x)

z1 s
f2(x) o
f
z2 t
f3(x)
m
a
x
… z3

fK(x)
Multi-layer Perceptron

x1
s
x2 o
f
… t
x3 m
a
… … … … x

xL

g = nonlinear activation function


Multi-layer Perceptron
Common Activation Functions

[source: MIT 6.S191 introtodeeplearning.com]


Multi-layer Perceptron
Training the MLP neural network is just like logistic regression:

just w tends to be a larger vector

just run gradient ascent è Back-propagation algorithm


Neural Networks Properties
Theorem (Universal Function Approximators). A two-layer
neural network with a sufficient number of neurons can
approximate any continuous function to any desired accuracy.

Practical considerations
Can deal with more complex, nonlinear classification & regression
Large number of neurons and weights
Danger for overfitting
Deep Learning Model

Neural network as
General computation graph

Krizhevsky, Suskever, Hinton, 2012


Deep Learning Model
Deep Learning Model
§ We need good features!

Feature Extraction Classification “Panda”?

Prior Knowledge,
Experience

Pose Occlusion Multiple Inter-class


objects similarity

Image courtesy of M. Ranzato


Deep Learning Model

§ Directly learn features representations from data.


§ Joint learn feature representation and classifier.

More abstract representation

Low-level Mid-level High-level


Features Features Features
Classifier “Panda”?

Deep Learning: train layers of features so that classifier works well.

Image courtesy of M. Ranzato


Deep Learning Model
Have we been here before?
ØYes.
• Basic ideas common to past neural networks research
• Standard machine learning strategies still relevant.
ØNo.
Today’s Deep Learning

Computational
Large-scale Data New Algorithms
Power
Deep Learning Model
Convolutional Neural Networks (CNNs)
§ A special multi-stage architecture inspired by visual system
§ Higher stages compute more global, more invariant features
Deep Learning Model

https://www.datasciencecentral.com/lenet-5-a-classic-cnn-architecture/
Different Neural Network Architectures
§ Exploration of different neural network architectures
§ ResNet: residual networks
§ Networks with attention
§ Transformer networks
§ Neural network architecture search
§ Really large models
§ GPT2, GPT3
§ CLIP

37
Acknowledgement

The lecture slides are based on the materials from ai.Berkey.edu


Thank you. Questions?

You might also like