0% found this document useful (0 votes)
10 views19 pages

Lecture 19

The lecture discusses the challenges of image recognition, including variations in lighting, viewpoint, and occlusion, which complicate object identification. It highlights the use of convolutional neural networks (CNNs) like LeNet5, which utilize replicated feature detectors and pooling to improve recognition accuracy without requiring segmentation. The presentation also touches on the importance of prior knowledge and data augmentation in enhancing machine learning models for image recognition tasks.

Uploaded by

zapwix0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Lecture 19

The lecture discusses the challenges of image recognition, including variations in lighting, viewpoint, and occlusion, which complicate object identification. It highlights the use of convolutional neural networks (CNNs) like LeNet5, which utilize replicated feature detectors and pooling to improve recognition accuracy without requiring segmentation. The presentation also touches on the importance of prior knowledge and data augmentation in enhancing machine learning models for image recognition tasks.

Uploaded by

zapwix0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Learning from Big Data

Lecture 19: Image recognition and CNNs

Dr. Lloyd T. Elliott, Fall 2022


Why image recognition is difficult

'Typographic attack': pen and paper fool AI into thinking apple is an iPod

• Recognizing objects in real scene


• Variation in lighting and viewpoint
• De nition of objects
• Requires huge amounts of knowledge (even for segmentation and viewport /
lighting)
fi
State of the art
• ADOP: Approximate di erentiable one-pixel point rendering (University of
Erlangen-Nuremberg)

• https://www.youtube.com/watch?v=WJRyu1JUtVw
ff
Things that make it hard

• Segmentation: real scenes are cluttered with other objects:


• Hard to tell which pieces go together
• Parts of an object can be hidden or clipped (occlusion)
• Lighting: Intensities are as much determined by lighting as by nature of object
• Deformation: Wide variety of shapes have the same name
• A ordances: For many objects, function is more important than shape for
de nition
ff
fi
More things that make it hard to recognize objects

• Viewpoint: wide variety of viewpoints for the same object


• "Information hops between input dimensions" dimension hopping

• We don't see this for many types of structured data (medical for example)
Viewpoint invariance

• Each time we look at an object, we have a di erent viewpoint. Unlike other machine
learning tasks

• Humans are so good at viewpoint variation, it's hard to appreciate how di cult it is
• One of the main di culties in computer vision
• Typical approaches:
• Use redundant invariant features
• Bounding boxes
• Replicated features with pooling "convolutional neurons"
ffi
ff
ffi
Invariant feature approach

• Extract large, overlapping / redundant set of features invariant to


transformations (rotation, scaling, translation, shear, stretch)

• Example: centre / surround for visual eld


• Problem: features will overlap with objects that are not in foreground ("parts
of di erent objects")

• Put a box around objects


• Normalize within the box
• Choosing the box is di cult (chicken / egg problem)
ff
ffi
fi
Brute force normalization

• When training recognizer, use well-segmented upright images to t correct


box

• At test time, try all possible boxes in a range of position and scales
• This was used often in computer vision ~2015

fi
Convolutional neural nets

• LeNet 1990s
• Use many di erent copies of the same feature detector with di erent
positions

• A feature detector useful in one place in the


image is likely useful in other areas too

• When we learn, we keep the red arrows all


having the same weights as each other

Red connections all have the same weight


ff
ff
Convolutional neural nets

• Replication greatly reduces the number of free parameters to be learned


• In this example 27 -> 9 weights
• Make many maps, each one with replicates
of the same feature. Di erent maps learn to
detect di erent features.

• Each patch of the image can then be


represented by features of many di erent types

Red connections all have the same weight


ff
ff
ff
Backpropagation with weight constraints

• It's easy to modify the backpropagation algorithm to incorporate linear


constraints between the weights

• We compute gradients as usual, but we modify gradients so that they satisfy


the constraints
w1 = w2 ! w1 = w2
<latexit sha1_base64="LD7YG2MnmpYdYhnbm0b3qUuXbyQ=">AAACFnicbVDLSgMxFM3UV62vUZdugkVwY5kpRd0IRV24rGAf0BmGTJq2oZkHyR1LGfoVbvwVNy4UcSvu/BvTdhBtPXDh5Jx7yb3HjwVXYFlfRm5peWV1Lb9e2Njc2t4xd/caKkokZXUaiUi2fKKY4CGrAwfBWrFkJPAFa/qDq4nfvGdS8Si8g1HM3ID0Qt7llICWPPNk6Nn4Ag+9MnYk7/WBSBkNsXPNBBA8M38eZc8sWiVrCrxI7IwUUYaaZ346nYgmAQuBCqJU27ZicFMigVPBxgUnUSwmdEB6rK1pSAKm3HR61hgfaaWDu5HUFQKeqr8nUhIoNQp83RkQ6Kt5byL+57UT6J67KQ/jBFhIZx91E4EhwpOMcIdLRkGMNCFUcr0rpn0iCQWdZEGHYM+fvEga5ZJ9WqrcVorVyyyOPDpAh+gY2egMVdENqqE6ougBPaEX9Go8Gs/Gm/E+a80Z2cw++gPj4xu9bJ0/</latexit>

• This is done as follows:


• Compute @E @E
<latexit sha1_base64="5xbcRYBlUAx0rL1ZgOhEGzrbHNQ=">AAACKXicfVBbS8MwGE29znmr+uhLcAg+yGjHUB+HIvg4wV1gLSXN0i0sTUuSKqP07/jiX/FFQVFf/SOmW0HdxAOBwznnS/IdP2ZUKst6NxYWl5ZXVktr5fWNza1tc2e3LaNEYNLCEYtE10eSMMpJS1HFSDcWBIU+Ix1/dJH7nVsiJI34jRrHxA3RgNOAYqS05JkNJxAIp06MhKKIwcvsm995dnYM/w3UMs+sWFVrAjhP7IJUQIGmZz47/QgnIeEKMyRlz7Zi5ab5lZiRrOwkksQIj9CA9DTlKCTSTSebZvBQK30YREIfruBE/TmRolDKcejrZIjUUM56ufiX10tUcOamlMeJIhxPHwoSBlUE89pgnwqCFRtrgrCg+q8QD5FuRulyy7oEe3bledKuVe2Tav26XmmcF3WUwD44AEfABqegAa5AE7QABvfgEbyAV+PBeDLejI9pdMEoZvbALxifX8Rbp40=</latexit>

,
@w1 @w2

• Use update @E @E for w1 and w2


<latexit sha1_base64="y+uFVK9ycT8U4mZLwiEDg54HMuc=">AAACPHicfVDNS8MwHE3n15xfU49egkMQhNGOoR6HH+BxovuAtZQ0S7ewtA1J6hilf5gX/whvnrx4UMSrZ7OtoG7ig8DjvfdL8nseZ1Qq03wycguLS8sr+dXC2vrG5lZxe6cpo1hg0sARi0TbQ5IwGpKGooqRNhcEBR4jLW9wPvZbd0RIGoW3asSJE6BeSH2KkdKSW7yxLwhTCA6hzUXEVQRtXyCc2BwJRRGDl+k3H7pWCo/g/5FK6hZLZtmcAM4TKyMlkKHuFh/tboTjgIQKMyRlxzK5cpLxlZiRtGDHknCEB6hHOpqGKCDSSSbLp/BAK13oR0KfUMGJ+nMiQYGUo8DTyQCpvpz1xuJfXidW/qmT0JDHioR4+pAfM6g7GjcJu1QQrNhIE4QF1X+FuI90M0r3XdAlWLMrz5NmpWwdl6vX1VLtLKsjD/bAPjgEFjgBNXAF6qABMLgHz+AVvBkPxovxbnxMozkjm9kFv2B8fgEW766n</latexit>

w/ +
@w1 @w2
• We can thus force backpropagation to use replicated features
What does replicating the features achieve?

• Equivariant activities: the neural activities in the next layer are not invariant to
translation, but they are equivariant

• Representation changes by as much as image


• Invariant knowledge: if a feature can be detected in one location, it can be
detected in other locations too
Pooling the output of replicated feature detectors

• To get invariance in activity, we must pool the output of the convolutional


layer

• Average / maximum together neighbouring replicated detectors to give a


single output to the next level

• Reduces number of inputs to the next layer (means, we can learn more
features)

• Problem: after several levels of this pooling, we lose information about the
precise location of the object (that's ne for kilns for example)
fi
LeNet5
• Yann LeCun and collaborators developed the rst good recognizer for
handwritten digits using backpropagation and feedforward

• Many hidden layers, many maps of replicated units, pooling between layers.
Did not require segmentation

• Was deployed by USPS, ~10% of zip code reading in USA in early 2000s

fi
LeNet5 in tensorflow

• medium.com/@mgazar
model = keras.Sequential()
model.add(layers.Conv2D(filters=6, kernel_size=(3, 3), activation='relu', input_shape=(32,32,1)))
model.add(layers.AveragePooling2D())
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu'))
model.add(layers.AveragePooling2D())
model.add(layers.Flatten())
model.add(layers.Dense(units=120, activation='relu'))
model.add(layers.Dense(units=84, activation='relu'))
model.add(layers.Dense(units=10, activation = 'softmax'))
Prior knowledge in machine learning

• LeNet5 prior knowledge:


• Connectivity
• Weight constraints
• Activation functions
• Less intrusive than hand engineering features, but still pushes the network
towards a particular way of solving the problem

• Alternative: use prior knowledge to create more training data: augment


training data with simulated data (Hofman 1993)
More tricks

• Data augmentation
• Subsample & transform training images (AugMix Hendrycks et al. 2019)
Thank you

You might also like