0% found this document useful (0 votes)
23 views

Application of Deep Learning Part1

The document discusses image captioning and summarizes several papers on generating image descriptions with recurrent neural networks. It provides examples of captions generated for various images and discusses failure cases as well as applications to visual question answering and visual dialog.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Application of Deep Learning Part1

The document discusses image captioning and summarizes several papers on generating image descriptions with recurrent neural networks. It provides examples of captions generated for various images and discusses failure cases as well as applications to visual question answering and visual dialog.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Image Captioning

Figure from Karpathy et a, “Deep


Visual-Semantic Alignments for Generating
Image Descriptions”, CVPR 2015; figure
copyright IEEE, 2015.
Reproduced for educational purposes.

Lecture
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. 10 - April 29, 2021
Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
Show and Tell: A Neural Image Caption Generator, Vinyals et al.
Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.
Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Fei-Fei Li, Ranjay Krishna, Danfei Xu 1


Recurrent Neural
Network

Lecture 10 - April 29, 2021


Convolutional Neural Network

Fei-Fei Li, Ranjay Krishna, Danfei Xu 2


test image

This image is CC0 public domain

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - April 29, 2021


test image

Fei-Fei Li, R anjay Krishna, Danfei Xu Lecture 10 - April 29, 2021


test image

X
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - April 29, 2021
test image

x0
<START>

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - April 29, 2021


test image

y0

before:
h = tanh(Wxh * x + Whh * h)
h0

Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<START>

v
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - April 29, 2021
test image

y0

sample!
h0

Lecture 10 - April 29, 2021

x0
<START> straw

Fei-Fei Li, Ranjay Krishna, Danfei Xu


test image

y0 y1

h0 h1

Lecture 10 - April 29, 2021

x0
<START> straw

Fei-Fei Li, Ranjay Krishna, Danfei Xu


test image

y0 y1

h0 h1
sample!

Lecture 10 - April 29, 2021

x0
<START> straw hat

Fei-Fei Li, Ranjay Krishna, Danfei Xu


test image

y0 y1 y2

h0 h1 h2

Lecture 10 - April 29, 2021

x0
<START> straw hat

Fei-Fei Li, Ranjay Krishna, Danfei Xu


test image

y0 y1 y2

sample
<END> token
h0 h1 h2 => finish.

Lecture 10 - April 29, 2021

x0
<START> straw hat

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Captions generated using neuraltalk2

Image Captioning: Example Results


All images are CC0 Public domain:
cat suitcase, cat tree, dog, bear,
surfers, tennis, giraffe, motorcycle

A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass

Lecture 10 - 13 April 29, 2021


Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Captions generated using neuraltalk2

Image Captioning: Failure Cases


All images are CC0 Public domain: fur
coat, handstand, spider web, baseball

A bird is perched on
a tree branch

A woman is holding a cat


in her hand

A man in a
baseball uniform
throwing a ball

Lecture 10 - 14 April 29, 2021


A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Visual Question Answering (VQA)

Lecture 10 - 15 April 29, 2021

Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015


Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016
Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Visual Question Answering: RNNs with Attention

Lecture 10 - 16 April 29, 2021

Agrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2015
Figures from Agrawal et al, copyright IEEE 2015. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Visual Dialog: Conversations about images

Lecture 10 - 17 April 29, 2021

Das et al, “Visual Dialog”, CVPR 2017


Figures from Das et al, copyright IEEE 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Visual Language Navigation: Go to the living room
Agent encodes instructions in
language and uses an RNN to
generate a series of movements as
the visual input changes after each
move.

Lecture 10 - 18 April 29, 2021


Wang et al, “Reinforced Cross-Modal Matching and Self-Supervised
Imitation Learning for Vision-Language Navigation”, CVPR 2018
Figures from Wang et al, copyright IEEE 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu


All images are CC0 Public domain:

Visual Question Answering: Dataset Bias


dog,

Image

Model Yes or No
What is the dog Question
playing with?

Frisbee Answer

Lecture 10 - 19 April 29, 2021

Jabri et al. “Revisiting Visual Question Answering Baselines” ECCV 2016

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Multilayer RNNs

10 - 20
Lecturedepth April 29, 2021

time

Fei-Fei Li, Ranjay Krishna, Danfei Xu


Source -Fei-Fei Li, Ranjay
Krishna, Danfei Xu

You might also like