0% found this document useful (0 votes)

5 views

Ch10 Deep Learning

Deep learning, which re-emerged around 2010, has become dominant due to advancements in computing power, larger datasets, and software like TensorFlow and PyTorch. Pioneers in the field, including Yann LeCun, Geoffrey Hinton, and Yoshua Bengio, were awarded the 2019 ACM Turing Award for their contributions to neural networks. The document also discusses the architecture and functioning of neural networks and convolutional neural networks (CNNs), highlighting their applications in image classification.

Uploaded by

Sthefanny Castro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Ch10 Deep Learning

Uploaded by

Sthefanny Castro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

Deep Learning

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.

1 / 46
Deep Learning
Neural networks became popular in the 1980s.
Lots of successes, hype, and great conferences: NeurIPS,
Snowbird.
Then along came SVMs, Random Forests and Boosting in the
1990s, and Neural Networks took a back seat.
Re-emerged around 2010 as Deep Learning.
By 2020s very dominant and successful.
Part of success due to vast improvements in computing power,
larger training sets, and software: Tensorflow and PyTorch.

Much of the credit goes to three pioneers and

their students: Yann LeCun, Geoffrey Hinton
and Yoshua Bengio, who received the 2019
ACM Turing Award for their work in Neural
Networks.
1 / 46
Single Layer Neural Network
PK
f (X) = β0 + k=1 βk hk (X)
PK Pp
= β0 + k=1 βk g(wk0 + j=1 wkj Xj ).

Input Hidden Output

Layer Layer Layer

A3 f (X) Y
X3

2 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

z
Pp
• Ak = hk (X) = g(wk0 + j=1 wkj Xj ) are called the
activations in the hidden layer.
• g(z) is called the activation function. Popular are the
sigmoid and rectified linear, shown in figure.
• Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model.
• So the activations are like derived features — nonlinear
transformations of linear combinations of the features.

3 / 46
Details

1.0
sigmoid
ReLU

0.8
0.6
g(z)

0.4
0.2
0.0
−4 −2 0 2 4

Handwritten digits
28 × 28 grayscale images
60K train, 10K test images
Features are the 784 pixel
grayscale values ∈ (0, 255)
Labels are the digit class 0–9

• Goal: build a classifier to predict the image class.

• We build a two-layer network with 256 units at first layer,
128 units at second layer, and 10 units at output layer.
• Along with intercepts (called biases) there are 235,146
parameters (referred to as weights)

4 / 46
Input
layer

Hidden
X1
layer L1
Hidden
layer L2
(1) Output
X2 A1 layer
(2)
A1
(1)
X3 A2 f0 (X) Y0
(2)
A2
(1)
X4 A3 f1 (X) Y1
(2)
A3
(1) . .
X5 A4 . .
. .
.
.
.
.
X6 . f9 (X) Y9
.
(2)
AK
2
. (1) B
. AK
. 1
W2

Xp W1

5 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function

eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e

6 / 46
Details of Output Layer
(2)
• Let Zm = βm0 + K
P 2
`=1 βm` A` , m = 0, 1, . . . , 9 be 10 linear
combinations of activations at second layer.
• Output activation function encodes the softmax function

eZm
fm (X) = Pr(Y = m|X) = P9 .
Z`
`=0 e

• We fit the model by minimizing the negative multinomial

log-likelihood (or cross-entropy):
n X
X 9
− yim log(fm (xi )).
i=1 m=0

• yim is 1 if true class for observation i is m, else 0 — i.e.

one-hot encoded.
6 / 46
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• Early success for neural networks in the 1990s.

• With so many parameters, regularization is essential.
• Some details of regularization and fitting will come later.

7 / 46
Results
Method Test Error
Neural Network + Ridge Regularization 2.3%
Neural Network + Dropout Regularization 1.8%
Multinomial Logistic Regression 7.2%
Linear Discriminant Analysis 12.7%

• Early success for neural networks in the 1990s.

• With so many parameters, regularization is essential.
• Some details of regularization and fitting will come later.
• Very overworked problem — best reported rates are
< 0.5%!
• Human error rate is reported to be around 0.2%, or 20 of
the 10K test images.

7 / 46
Convolutional Neural Network — CNN

• Major success story for classifying images.

• Shown are samples from CIFAR100 database. 32 × 32 color
natural images, with 100 classes.
• 50K training images, 10K test images.
Each image is a three-dimensional array or feature map:
32 × 32 × 3 array of 8-bit numbers. The last dimension
represents the three color channels for red, green and blue.
8 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.

9 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.

• Edges and shapes are recognized and pieced together to
form more complex shapes, eventually assembling the
target image.

9 / 46
How CNNs Work

• The CNN builds up an image in a hierarchical fashion.

• Edges and shapes are recognized and pieced together to
form more complex shapes, eventually assembling the
target image.
• This hierarchical construction is achieved using convolution
and pooling layers.
9 / 46
Convolution Filter
 
a b c
d e f
Input Image =   Convolution Filter = α β .
g h i γ δ
j k l
 
aα + bβ + dγ + eδ bα + cβ + eγ + f δ
Convolved Image = dα + eβ + gγ + hδ eα + f β + hγ + iδ 
gα + hβ + jγ + kδ hα + iβ + kγ + lδ

• The filter is itself an image, and represents a small shape,

edge etc.
• We slide it around the input image, scoring for matches.
• The scoring is done via dot-products, illustrated above.
• If the subimage of the input image is similar to the filter,
the score is high, otherwise low.
• The filters are learned during training.
10 / 46
Convolution Example

• The idea of convolution with a filter is to find common

patterns that occur in different parts of the image.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common

patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common

patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.
• The result of the convolution is a new feature map.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common

patterns that occur in different parts of the image.
• The two filters shown here highlight vertical and horizontal
stripes.
• The result of the convolution is a new feature map.
• Since images have three colors channels, the filter does as
well: one filter per channel, and dot-products are summed.

11 / 46
Convolution Example

• The idea of convolution with a filter is to find common

 
1 2 5 3
3 0 1 2
Max pool  → 3 5
2 1 3 4 2 4
1 1 2 0

• Each non-overlapping 2 × 2 block is replaced by its

maximum.
• This sharpens the feature identification.
• Allows for locational invariance.
• Reduces the dimension by a factor of 4 — i.e. factor of 2 in
each dimension.

12 / 46
Architecture of a CNN
8
16
32 8 4
32
16

10
50

0
32