Lecture1.2 - Multimodal Research Tasks
Lecture1.2 - Multimodal Research Tasks
课件 & 代码 · 博客 [ 扫码或点击链接 ]
http://blog.showmeai.tech/cmu-11-777
微信公众号
协同学习 图模型
多模态 强化学习 判别模型
生成模型 深度生成模型
CNN 资料下载方式 2:扫码点击底部菜单栏
称为 AI 内容创作者?回复 [ 添砖加瓦 ]
RNN 多模态强化学习 视觉表示 语言表示
Multimodal Machine Learning
Lecture 1.2:
Multimodal Research Tasks
Louis-Philippe Morency
Guest lecture by Paul Liang
1
Lecture Objectives
10
Multimodal Research Tasks
Birth of
“Language & Vision”
Birth of research
“affective computing”
Content- Video
Image
based event
captioning
video recognition
(revisited)
retrieval (TrecVid)
Audio-
Affect and Multimodal
visual
emotion sentiment
speech
recognition analysis
recognition
11
Multimodal Research Tasks
… and many
many more!
12
Real world tasks tackled by MMML
A. Affect recognition
§ Emotion
§ Personalities
§ Sentiment
B. Media description
§ Image and video captioning
C. Multimodal QA
§ Image and video QA
§ Visual reasoning
D. Multimodal Navigation
§ Language guided navigation
§ Autonomous driving
Real world tasks tackled by MMML
E. Multimodal Dialog
§ Grounded dialog
F. Event recognition
§ Action recognition
§ Segmentation
G. Multimedia information
retrieval
§ Content based/Cross-
media
Affective Computing
15
Common Topics in Affective Computing
16
Common topics in affective computing
§ Affective states
§ Anger § Frustration
§ Cognitive states § Disgust § Anxiety
§ Personality § Fear § Contempt
§ Pathology § Happiness § Shame
§ Sadness § Guilt
§ Social processes § Positivity § Wonder
§ Activation § Relaxation
§ Pride § Pain
§ Desire § Envy
17
Common topics in affective computing
§ Affective states
§ Engagement
§ Cognitive states § Interest
§ Personality § Attention
§ Concentration
§ Pathology § Effort
§ Social processes § Surprise
§ Confusion
§ Agreement
§ Doubt
§ Knowledge
18
Common topics in affective computing
§ Affective states
§ Outgoing § Pessimistic
§ Cognitive states § Assertive § Anxious
§ Personality § Energetic § Moody
§ Pathology § Sympathetic § Curious
§ Respectful § Artistic
§ Social processes § Trusting § Creative
§ Organized § Sincere
§ Productive § Modest
§ Responsible § Fair
19
Common topics in affective computing
§ Affective states
§ Depression
§ Cognitive states § Anxiety
§ Personality § Trauma
§ Addiction
§ Pathology § Schizophrenia
§ Social processes § Antagonism
§ Detachment
§ Disinhibition
§ Negative Affectivity
§ Psychoticism
20
Common topics in affective computing
21
11-776 Multimodal Affective Computing
Audio-visual Emotion Challenge 2011/2012
§ Arousal, expectancy,
power, valence
§ Has transcripts
[AVEC 2011 – The First International Audio/Visual Emotion Challenge, B. Schuller et al., 2011]
Audio-visual Emotion Challenge 2013/2014
AVEC 2013/2014
[AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, Valstar et
al. 2013]
Audio-visual Emotion Challenge 2015/2016
§ RECOLA dataset
§ Audio-Visual emotion recognition
§ Labeled for dimensional emotion per frame
(arousal, valence)
AVEC 2015
§ Includes physiological data
§ 27 participants
§ French, audio, video, ECG and EDA
§ Collaboration task in video conference
§ Broader range of emotive expressions
[Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions, F.
Ringeval et al., 2013]
Multimodal Sentiment Analysis
Big dog
on the
beach
Prediction 2 1
𝑡#
𝑡!
𝑡"
Project Example: Select-Additive Learning
Research task: Multimodal sentiment analysis
Datasets: MOSI, YouTube, MOUD
Main idea: Reducing the effect of confounding factors when limited dataset size
Confounding factor!
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency and Eric P. Xing, Select-additive Learning: Improving
Generalization In Multimodal Sentiment Analysis, ICME 2017, https://arxiv.org/abs/1609.05244
Project Example: Select-Additive Learning
Hypothesis: the
representation is a
mixture from the
person-independent
factor g(X) and the
person-dependent
factor h(Z).
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency and Eric P. Xing, Select-additive Learning: Improving
Generalization In Multimodal Sentiment Analysis, ICME 2017, https://arxiv.org/abs/1609.05244
Project Example: Word-Level Gated Fusion
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924
Project Example: Word-Level Gated Fusion
Solution:
- Word-level alignment
- Temporal attention over words
Hypothesis: attention
- Gated attention over modalities weights represent
contribution of each
modality at each time
step
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924
Media Description
34
Media description
Based on audio descriptions for the blind (Descriptive Video Service – DVS)
https://visualgenome.org/
What are the Core Challenges Most Involved in
Media Description?
Big dog
on the
beach
Prediction 2 1
𝑡#
𝑡!
𝑡"
Multimodal QA
46
Visual
§ Real images
§ 200k MS COCO images
§ 600k questions
§ 6M answers
§ 1.8M plausible answers
§ Abstract images
§ 50k scenes
§ 150k questions
§ 1.5M answers
§ 450k plausible answers
VQA Challenge 2016 and 2017 (C1)
Big dog
on the
beach
Prediction 2 1
𝑡#
𝑡!
𝑡"
Project Example: Adversarial Attacks on VQA models
Vasu Sharma, Ankita Kalra, Vaibhav, Simral Chaudhary, Labhesh Patel, Louis-Philippe Morency, Attend and Attack:
Attention Guided Adversarial Attacks on Visual Question Answering Models. NeurIPS ViGIL workshop 2018.
https://nips2018vigil.github.io/static/papers/accepted/33.pdf
Project Example: Adversarial Attacks on VQA models
+ VQA model
A: Roses to Sunflower
How can we design a targeted attack on images in VQA models, which will
help in assessing robustness of existing models?
Vasu Sharma, Ankita Kalra, Vaibhav, Simral Chaudhary, Labhesh Patel, Louis-Philippe Morency, Attend and Attack:
Attention Guided Adversarial Attacks on Visual Question Answering Models. NeurIPS ViGIL workshop 2018.
https://nips2018vigil.github.io/static/papers/accepted/33.pdf
Project Example: Adversarial Attacks on VQA models
Hypothesis:
question helps to
localize important
visual regions for
targeted
adversarial attacks
Vasu Sharma, Ankita Kalra, Vaibhav, Simral Chaudhary, Labhesh Patel, Louis-Philippe Morency, Attend and Attack:
Attention Guided Adversarial Attacks on Visual Question Answering Models. NeurIPS ViGIL workshop 2018.
https://nips2018vigil.github.io/static/papers/accepted/33.pdf
Multimodal
Navigation
60
Embedded Assistive Agents
61
Language, Vision and Actions
62
Many Technical Challenges
Instruction:
Find the window. Look left at the cribs. Search for the tricolor crib. The target is below that crib.
Instruction generation
Instruction following
Linking Action-Language-Vision
Refer360 Dataset
Step1
place the door leading
outside to center. 1
Step2
notice the silver and black
coffee pot closest to you
on the bar. see the black 2
trash bin on the floor in
front of the coffee pot
3
Step3
waldo is on the face of the 4
trash bin about 1 foot off
the floor and also slightly
on the brown wood
Language meets Games
Heinrich Kuttler and Nantas Nardelli and Alexander H. Miller and Roberta Raileanu and Marco Selvatici and Edward Grefenstette
and Tim Rocktaschel, The Nethack Learning Environment. https://arxiv.org/abs/2006.13760
Language meets Games
Shrimai Prabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, Arthur Szlam. I love your chain mail!
Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents. https://arxiv.org/abs/2002.02878
What are the Core Challenges Most Involved in
Multimodal Navigation?
Big dog
on the
beach
Prediction 2 1
𝑡#
𝑡!
𝑡"
Project Example: Instruction Following
Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov,
Gated-Attention Architectures for Task-Oriented Language Grounding. AAAI 2018 https://arxiv.org/abs/1706.07230
Project Example: Instruction Following
Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov,
Gated-Attention Architectures for Task-Oriented Language Grounding. AAAI 2018 https://arxiv.org/abs/1706.07230
Project Example: Multiagent Trajectory Forecasting
Seong Hyeon Park, Gyubok Lee, Manoj Bhat, Jimin Seo, Minseok Kang, Jonathan Francis, Ashwin R. Jadhav, Paul Pu Liang,
Louis-Philippe Morency, Diverse and Admissible Trajectory Forecasting through Multimodal Context Understanding. ECCV 2020
https://arxiv.org/abs/1706.07230
Project Example: Multiagent Trajectory Forecasting
Hypothesis: both
agent-agent
interactions and agent-
scene interactions are
important!
Seong Hyeon Park, Gyubok Lee, Manoj Bhat, Jimin Seo, Minseok Kang, Jonathan Francis, Ashwin R. Jadhav, Paul Pu Liang,
Louis-Philippe Morency, Diverse and Admissible Trajectory Forecasting through Multimodal Context Understanding. ECCV 2020
https://arxiv.org/abs/1706.07230
Project Examples,
Advice and Support
73
Our Latest List of Multimodal Datasets
E. Multimodal Dialog
VISDIAL E1
Talk the Walk E2 G. Cross-media Retrieval
Vision-and-Dialog Navigation E3
CLEVR-Dialog E4 IKEA G1
Fashion Retrieval E5 MIRFLICKR G2
NUS-WIDE G3
YAHOO-FLICKR G4
F. Event Detection YOUTUBE-8M G5
YOUTUBE-BOUNDING G6
WHATS-COOKING F1 YOUTUBE-OPEN G7
TACOS F2 VIST G8
TACOS-MULTI F3 Recipe1M+ G9
YOU-COOK F4 VATEX G10
MED F5
TITLE-VIDEO-SUMM F6
MEDIA-EVAL F7
CRISSMMD F8
… and please let us know (via Piazza) when you find more!
More Project Examples
See the course website:
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/projects/
Some Advice About Multimodal Research
82
Appendix: List of
Multimodal datasets
83
Affect recognition dataset 1 (A1)
AVEC 2015/2016
Affect recognition dataset 3 (A3)
§ http://allenai.org/plato/charades/
§ 9848 videos of daily indoors activities
§ 267 different users
§ Recording videos at home
§ Home quality videos
Media Description – Referring Expression datasets (B6)
§ Referring Expressions:
§ Generation (Bounding Box to Text) and Comprehension (Text to
Bounding Box)
§ Generate / Comprehend a noun phrase which identifies a particular
object in an image
§ Many datasets!
§ RefClef
§ RefCOCO (+, g)
§ GRef
Media Description - Referring Expression datasets (B7)
§ GuessWhat?!
§ Cooperative two-player
guessing game for language
grounding
§ Locate an unknown object in a
rich image scene by asking a
sequence of questions
§ 821,889 questions+answers
§ 66,537 images and 134,073
objects
Media Description - other datasets (B8)
§ Flickr30k Entities
§ Region-to-Phrase Correspondences for Richer Image-to-Sentence
Models
§ 158k captions
§ 244k coreference chains
§ 276k manually annotated bounding boxes
CSI Corpus (B9)
§ https://visualgenome.org/
MovieGraph dataset (B15)
§ http://moviegraphs.cs.toronto.edu/
Media description technical challenges
§ Real images
§ 200k MS COCO images
§ 600k questions
§ 6M answers
§ 1.8M plausible answers
§ Abstract images
§ 50k scenes
§ 150k questions
§ 1.5M answers
§ 450k plausible answers
VQA Challenge 2016 and 2017 (C1)
§ DAQUAR (C2)
§ Synthetic QA pairs based on templates
§ 12468 human question-answer pairs
§ COCO-QA (C3)
§ Object, Number, Color, Location
§ Training: 78736
§ Test: 38948
Multimodal QA – other VQA datasets (C4)
§ Visual Madlibs
§ Fill in the blank Image Generation
and Question Answering
§ 360,001 focused natural language
descriptions for 10,738 images
§ collected using automatically
produced fill-in-the-blank
templates designed to gather
targeted descriptions about:
people and objects, their
appearances, activities, and
interactions, as well as inferences
about the general scene or its
broader context
Multimodal QA – other VQA datasets (C5)
§ Textbook Question Answering
§ Multi-Modal Machine Comprehension
§ Context needed to answer questions provided and composed of both
text and images
§ 78338 sentences, 3455 images
§ 26260 questions
Multimodal QA – other VQA datasets (C6)
§ Visual7W
§ Grounded Question Answering in Images
§ 327,939 QA pairs on 47,300 COCO images
§ 1,311,756 multiple-choices, 561,459 object groundings,
36,579 categories
§ what, where, when, who, why, how and which
Multimodal QA – other VQA datasets (C7)
§ TVQA
§ Video QA dataset based on 6 popular TV shows
§ 152.5K QA pairs from 21.8K clips
§ Compositional questions
Multimodal QA – Visual Reasoning (C8)
§ VCR: Visual Commonsense Reasoning
§ Model must answer challenging visual questions expressed in
language
§ And provide a rationale explaining why its answer is true.
Multimodal QA – Visual Reasoning (C9)
§ Cornell NLVR
§ 92,244 pairs of natural language statements grounded in
synthetic images
§ Determine whether a sentence is true or false about an image
Multimodal QA – Visual Reasoning (C10)
§ CLEVR
§ A Diagnostic Dataset
for Compositional Language
and Elementary Visual
Reasoning
§ Tests a range of different
specific visual reasoning
abilities
§ Training set: 70,000 images
and 699,989 questions
§ Validation set: 15,000 images
and 149,991 questions
§ Test set: 15,000 images and
14,988 questions
Embodied Question Answering (C11)
§ Title-based Video
Summarization dataset
§ 50 videos labeled for
scene importance, can be
used for summarization
based on the title
Event detection dataset 4 (F7)
§ MIRFLICKR-1M (G2)
§ 1M images with associated tags and captions
§ Labels of general and specific categories
§ NUS-WIDE dataset (G3)
§ 269,648 images and the associated tags from Flickr, with a
total number of 5,018 unique tags;
§ Yahoo Flickr Creative Commons 100M (G4)
§ Videos and images
§ Can also use image and video captioning datasets
§ Just pose it as a retrieval task
Other Multimodal Datasets (G5, G6, G7, G8, G9, G10)
§ 1) YouTube 8M (G5)
§ https://research.google.com/youtube8m/
§ 2) YouTube Bounding Boxes (G6)
§ https://research.google.com/youtube-bb/
§ 3) YouTube Open Images (G7)
§ https://research.googleblog.com/2016/09/introducing-open-
images-dataset.html
§ 4) VIST (G8)
§ http://visionandlanguage.net/VIST/
§ 5) Recipe1M+ (G9)
§ http://pic2recipe.csail.mit.edu/
§ 6) VATEX (G10)
§ https://eric-xw.github.io/vatex-website/
Cross-media retrieval challenges
课件 & 代码 · 博客 [ 扫码或点击链接 ]
http://blog.showmeai.tech/cmu-11-777
微信公众号
协同学习 图模型
多模态 强化学习 判别模型
生成模型 深度生成模型
CNN 资料下载方式 2:扫码点击底部菜单栏
称为 AI 内容创作者?回复 [ 添砖加瓦 ]
RNN 多模态强化学习 视觉表示 语言表示