0% found this document useful (0 votes)
5 views13 pages

Lecture 8.4

Uploaded by

kapiljain1989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Lecture 8.4

Uploaded by

kapiljain1989
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

INFO 557 FA24 002 -

Neural Networks
Instructor: Dr. Liang Zhang
TAs: Jiacheng Zhang, Ruoyao Wang
College of Information Science
University of Arizona
Quiz at Tophat (Join: 436056)
Use the web app or mobile app to answer:

With complete certainty, we know that initial parameters in a neural network


must:

A break symmetry between different units

B set bias parameters to constants

C be close to the origin

D be random orthogonal matrices

With your group, come to consensus on the correct answer, and discuss

what is right or wrong about each of the answers.


Quiz at Tophat (Join: 436056)
Use the web app or mobile app to answer:

With complete certainty, we know that initial parameters in a neural network


must:

A break symmetry between different units

B set bias parameters to constants

C be close to the origin

D be random orthogonal matrices

With your group, come to consensus on the correct answer, and discuss

what is right or wrong about each of the answers.


Parameter initialization is important
In neural network optimization, initial points determine:

● whether learning converges at all


● how quickly learning converges
● whether it converges to a point with high or low cost
● whether the point that it converges to has high or low generalization error
Initial parameters must break symmetry
If two units have the same initial parameters, they will get updated in the same
way.

Having two different units model exactly the same function is probably not useful

Most common solution: random parameter initialization

● biases typically not randomly initialized


● weights drawn from Gaussian or uniform distribution
Size of weights is important
Larger initial weights can result in:

+ better symmetry-breaking

+ less signal lost during forward/back-propagation

- exploding values during forward/back-propagation

- extreme sensitivity to small perturbations of the input

- extreme values where the activation function saturates

- parameters further from origin (i.e., less regularized)


Random initialization heuristics

For a layer with m inputs and n outputs:


common default

goal: same activation/gradient variance

W, _, _ = SVD(random matrix)

goal: all units are orthogonal


Random initialization heuristics
, where

goal: across-all-layer gradient magnitude ratio = 1

each unit has exactly k nonzero weights

goal: total magnitude small; individual weights large


Debugging initialization problems
Take a minibatch, then plot standard deviation of activations or gradients through
the layers of the network

Consider increasing weights of layer 3


Initializing biases
In most cases, initialize bi = 0

For output units, initialize to marginal statistics, e.g.,

● linear unit:

● softmax unit: b = softmax−1(c), where ci = P(y = i)

For ReLU, initialize bi = 0.1 to avoid saturation

For gate units, initialize bi = 1, i.e., open gate


Pre-training models
Weights can be initialized via training a simpler model

● Initialize a supervised model by training an unsupervised model on the same


inputs
● Initialize a supervised model by training another supervised model on a
related task
Group Activity
You plan to train a neural network to segment images into regions corresponding
to sky, trees, vehicles, etc. However, you have only a few thousand images
manually annotated for such segments.

ImageNet provides over 14 million images, tagged (but not segmented) for more
than 20 thousand word senses: 1633 images tagged with canine, 1380 images
tagged with motorcycle, etc. How might you use this data to pre-train your model?
How would you handle the architectural differences between the pre-training task
(predicting tags) and the real task (predicting segmentations)?
Solution: Pre-train with ImageNet & Fine-tune for
Segmentation
Step 1: Pre-train on ImageNet

● Leverage large-scale tagged data to pre-train a model for image classification.


● Learn general visual features (e.g., edges, textures, objects).

Step 2: Adapt Architecture

● Replace final fully connected layers (classification) with pixel-wise segmentation layers (fully
convolutional).
● Add upsampling or deconvolution layers to generate segmentation masks.
● Use encoder-decoder architectures like U-Net for better segmentation performance.

Step 3: Fine-tune the Model

● Freeze earlier layers (general feature extraction).


● Train new layers for segmentation using annotated images.
● Apply data augmentation (cropping, flipping, color jittering) to compensate for limited data.

You might also like