0% found this document useful (0 votes)

20 views154 pages

Lecture1.2 - Multimodal Research Tasks

Uploaded by

aleidm48

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views154 pages

Lecture1.2 - Multimodal Research Tasks

Uploaded by

aleidm48

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 154

CMU · 11-777 | Multimodal Machine Learning (2020)

CMU 11-777(2020)· 课程资料包 @ShowMeAI

Awesome AI Courses Notes Cheatsheets 是 ShowMeAI 资料库的分支
系列，覆盖最具知名度的 TOP50+ 门 AI 课程，旨在为读者和学习者提供
一整套高品质中文学习笔记和速查表。
点击课程名称，跳转至课程资料包页面，一键下载课程全部资料！

机器学习深度学习自然语言处理计算机视觉

视频课件笔记代码
中英双语字幕一键打包下载官方笔记翻译作业项目解析 Stanford · CS229 Stanford · CS230 Stanford · CS224n Stanford · CS231n

# Awesome AI Courses Notes Cheatsheets· 持续更新中

视频 · B 站 [ 扫码或点击链接 ] 知识图谱图机器学习深度强化学习自动驾驶

https://www.bilibili.com/video/BV1Pf4y1P7zc
Stanford · CS520 Stanford · CS224W UCBerkeley · CS285 MIT · 6.S094

课件 & 代码 · 博客 [ 扫码或点击链接 ]
http://blog.showmeai.tech/cmu-11-777

微信公众号
协同学习图模型
多模态强化学习判别模型

生成模型深度生成模型
CNN 资料下载方式 2：扫码点击底部菜单栏
称为 AI 内容创作者？回复 [ 添砖加瓦 ]
RNN 多模态强化学习视觉表示语言表示
Multimodal Machine Learning
Lecture 1.2:
Multimodal Research Tasks
Louis-Philippe Morency
Guest lecture by Paul Liang

* Original version co-developed with Tadas Baltrusaitis

1
Lecture Objectives

§ Understand the breath of possible tasks for

multimodal research
§ Research topics in affective computing
§ Media description and Multimodal QA
§ Multimodal navigation
§ Examples of previous course projects
§ Available multimodal datasets
Administrative Stuff
3
First Reading Assignment – Week 2

§ 3 paper options are available

§ Each student should pick one option!
§ Then you will create a short summary to help others
§ Discussions with your study group
§ 9-10 students in each study group
§ Discuss together the 3 papers. Ask questions!
§ But you should also try to answer the questions
§ Google Sheets were created to help balance the
papers between group members
First Reading Assignment – Week 2

Four main steps for the reading assignments

1. Monday 8pm: Official start of the assignment
2. Wednesday 8pm: Select your paper
3. Friday 8pm: Post your summary
4. Monday 8pm: End of the reading assignment

Details posted on Piazza

Lecture Highlights – Starting Next Week!

§ Students should summarize lecture highlights

§ Each lecture is split in 3 segments (~30mins each)
§ One highlight statement for each segment
§ Highlights submitted 42 hours after the lecture
§ Lecture can be watched live or asynchronously
§ Optionally, students can ask questions

Detailed instructions were also posted on Piazza

Process for Selecting your Course Project

§ Today: Lecture describing available multimodal

datasets and research topics
§ Tuesday 9/8: Let us know your dataset preferences
for the course project
§ Thursday 9/10: During the later part of the lecture,
we will have an interactive period to help with team
formation
§ Wednesday 9/16: Pre-proposals are due. You
should have selected your teammates, dataset and
task
§ Following week: meeting with TAs to discuss project
Course Project Timeline

§ Pre-proposal (Wednesday 9/16)

§ Define your dataset, research task and teammates
§ First project assignment (due Friday Oct. 9)
§ Experiment with unimodal representations
§ Study prior work on your selected research topic
§ Midterm project assignment (due Friday Nov. 12)
§ Implement and evaluate state-of-the-art model(s)
§ Discuss new multimodal model(s)
§ Final project assignment (due Friday Dec. 11)
§ Implement and evaluate new multimodal model(s)
§ Discuss results and possible future directions
Multimodal
Research Tasks
9
Prior Research on “Multimodal”

Four eras of multimodal research

Ø The “behavioral” era (1970s until late 1980s)

Ø The “computational” era (late 1980s until 2000)

Ø The “interaction” era (2000 - 2010)

Ø The “deep learning” era (2010s until …)

v Main focus of this course

1970 1980 1990 2000 2010

10
Multimodal Research Tasks

Birth of
“Language & Vision”
Birth of research
“affective computing”

Content- Video
Image
based event
captioning
video recognition
(revisited)
retrieval (TrecVid)

Audio-
Affect and Multimodal
visual
emotion sentiment
speech
recognition analysis
recognition

1970 1980 1990 2000 2010

11
Multimodal Research Tasks

… and many
many more!

Image Visual Language,

question Multimodal
captioning Vision and
answering dialogue
(revisited) (image-based)
Navigation

Video Video QA & Large-scale Self-driving

captioning & referring video event multimodal
“grounding” expressions retrieval navigation
(e.g., YouTube8M)

2015 2016 2017 2018 2019

12
Real world tasks tackled by MMML

A. Affect recognition
§ Emotion
§ Personalities
§ Sentiment
B. Media description
§ Image and video captioning
C. Multimodal QA
§ Image and video QA
§ Visual reasoning
D. Multimodal Navigation
§ Language guided navigation
§ Autonomous driving
Real world tasks tackled by MMML

E. Multimodal Dialog
§ Grounded dialog
F. Event recognition
§ Action recognition
§ Segmentation
G. Multimedia information
retrieval
§ Content based/Cross-
media
Affective Computing
15
Common Topics in Affective Computing

§ Affective states – emotions, moods, and feelings

§ Cognitive states – thinking and information processing
§ Personality – patterns of acting, feeling, and thinking
§ Pathology – health, functioning, and disorders
§ Social processes – groups, cultures, and perception

16
Common topics in affective computing

§ Affective states
§ Anger § Frustration
§ Cognitive states § Disgust § Anxiety
§ Personality § Fear § Contempt
§ Pathology § Happiness § Shame
§ Sadness § Guilt
§ Social processes § Positivity § Wonder
§ Activation § Relaxation
§ Pride § Pain
§ Desire § Envy

17
Common topics in affective computing

§ Affective states
§ Engagement
§ Cognitive states § Interest
§ Personality § Attention
§ Concentration
§ Pathology § Effort
§ Social processes § Surprise
§ Confusion
§ Agreement
§ Doubt
§ Knowledge

18
Common topics in affective computing

§ Affective states
§ Outgoing § Pessimistic
§ Cognitive states § Assertive § Anxious
§ Personality § Energetic § Moody
§ Pathology § Sympathetic § Curious
§ Respectful § Artistic
§ Social processes § Trusting § Creative
§ Organized § Sincere
§ Productive § Modest
§ Responsible § Fair

19
Common topics in affective computing

§ Affective states
§ Depression
§ Cognitive states § Anxiety
§ Personality § Trauma
§ Addiction
§ Pathology § Schizophrenia
§ Social processes § Antagonism
§ Detachment
§ Disinhibition
§ Negative Affectivity
§ Psychoticism

20
Common topics in affective computing

§ Affective states § Rapport

§ Cognitive states § Cohesion
§ Cooperation
§ Personality § Competition
§ Pathology § Status
§ Conflict
§ Social processes § Attraction
§ Persuasion
§ Genuineness
§ Culture
§ Skillfulness

21
11-776 Multimodal Affective Computing
Audio-visual Emotion Challenge 2011/2012

§ Part of a larger SEMAINE corpus

§ Sensitive Artificial Listener paradigm
§ Labeled for four dimensional
emotions (per frame) AVEC 2011/2012

§ Arousal, expectancy,
power, valence
§ Has transcripts

[AVEC 2011 – The First International Audio/Visual Emotion Challenge, B. Schuller et al., 2011]
Audio-visual Emotion Challenge 2013/2014

§ Reading specific text in a subset of

videos
§ Labeled for emotion per frame
(valence, arousal, dominance)
§ Performing an HCI task
§ Reading aloud a text in German
§ Responding to a number of
questions
§ 100 audio-visual sessions
§ Provide extracted audio-visual features

AVEC 2013/2014
[AVEC 2013 – The Continuous Audio/Visual Emotion and Depression Recognition Challenge, Valstar et
al. 2013]
Audio-visual Emotion Challenge 2015/2016

§ RECOLA dataset
§ Audio-Visual emotion recognition
§ Labeled for dimensional emotion per frame
(arousal, valence)
AVEC 2015
§ Includes physiological data
§ 27 participants
§ French, audio, video, ECG and EDA
§ Collaboration task in video conference
§ Broader range of emotive expressions

[Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions, F.
Ringeval et al., 2013]
Multimodal Sentiment Analysis

§ Multimodal Corpus of Sentiment

Intensity and Subjectivity
Analysis in Online Opinion
Videos (MOSI)
§ 89 speakers with 2199 opinion
segments
§ Audio-visual data with
transcriptions
§ Labels for sentiment/opinion
§ Subjective vs objective
§ Positive vs negative
Multimodal Sentiment Analysis

§ Multimodal sentiment and emotion recognition

§ CMU-MOSEI : 23,453 annotated video segments from 1,000
distinct speakers and 250 topics
Multi-Party Emotion Recognition

§ MELD: Multi-party dataset for emotion recognition in

conversations
What are the Core Challenges Most Involved in
Affect Recognition?

Big dog
on the
beach

Prediction 2 1
𝑡#

𝑡!
𝑡"
Project Example: Select-Additive Learning
Research task: Multimodal sentiment analysis
Datasets: MOSI, YouTube, MOUD

Main idea: Reducing the effect of confounding factors when limited dataset size

Legend What rules can you infer from this data?

Confounding factor!

Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency and Eric P. Xing, Select-additive Learning: Improving
Generalization In Multimodal Sentiment Analysis, ICME 2017, https://arxiv.org/abs/1609.05244
Project Example: Select-Additive Learning

Solution: Learning representations that reduce the effect of user identity

“Conventional”
representation learning Select-Additive Learning

Hypothesis: the
representation is a
mixture from the
person-independent
factor g(X) and the
person-dependent
factor h(Z).

Research task: Multimodal sentiment analysis

Datasets: MOSI, YouTube, MOUD

Main idea: Estimating importance of each modality at the word-level in a video.

How can we build an interpretable model that estimates modality and

temporal importance, and learns to attend to important information?

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, Louis-Philippe Morency, Multimodal Sentiment
Analysis with Word-Level Fusion and Reinforcement Learning, ICMI 2017, https://arxiv.org/abs/1802.00924
Project Example: Word-Level Gated Fusion
Solution:
- Word-level alignment
- Temporal attention over words
Hypothesis: attention
- Gated attention over modalities weights represent
contribution of each
modality at each time
step

Modality gates that

determine importance
and contribution of
each modality –
trained with
reinforcement learning

§ Given a media (image, video, audio-visual clips)

provide a free form text description
Large-Scale Image Captioning Dataset
§ Microsoft Common Objects in COntext (MS COCO)
§ 120000 images
§ Each image is accompanied with five free form sentences
describing it (at least 8 words)
§ Sentences collected using crowdsourcing (Mechanical Turk)
§ Also contains object detections, boundaries and keypoints
Evaluating Image Caption Generations

§ Has an evaluation server

§ Training and validation - 80K images (400K
captions)
§ Testing – 40K images (380K captions), a subset
contains more captions for better evaluation, these
are kept privately (to avoid over-fitting and cheating)
§ Evaluation is difficult as there is no one “correct” answer
for describing an image in a sentence
§ Given a candidate sentence it is evaluated against a set
of “ground truth” sentences
Evaluating Image Captioning Results

§ A challenge was done with actual human

evaluations of the captions (CVPR 2015)
Evaluating Image Captioning Results

§ A challenge was done with actual human

evaluations of the captions (CVPR 2015)
Evaluating Image Captioning Results

§ What about automatic evaluation?

§ Human labels are expensive…
§ Have automatic ways to evaluate
§ CIDEr-D, Meteor, ROUGE, BLEU
§ How do they compare to human evaluations
§ No so well …
Video captioning

Based on audio descriptions for the blind (Descriptive Video Service – DVS)

• Alignment is a challenge since description

can happen after the video segment
• Only one single caption per clip –
Challenge with evaluation
Video Description and Alignment

Let’s ask MTurk users to “act” the description!

Charade Dataset: http://allenai.org/plato/charades/

First author was student in first edition of MMML course!

How to Address the Challenge of Evaluation?

Referring Expressions: Generate / Comprehend a noun

phrase which identifies a particular object in an image

This is related to “grounding” which links linguistic elements to

the shared environment (in this case, it’s an image)
Large-Scale Description and Grounding Dataset
Visual Genome Dataset

https://visualgenome.org/
What are the Core Challenges Most Involved in
Media Description?

Big dog
on the
beach

Prediction 2 1
𝑡#

𝑡!
𝑡"
Multimodal QA
46
Visual

§ Task - Given an image and a question, answer

the question (http://www.visualqa.org/)
Multimodal QA dataset 1 – VQA (C1)

§ Real images
§ 200k MS COCO images
§ 600k questions
§ 6M answers
§ 1.8M plausible answers
§ Abstract images
§ 50k scenes
§ 150k questions
§ 1.5M answers
§ 450k plausible answers
VQA Challenge 2016 and 2017 (C1)

§ Two challenges organized these past two years (link)

§ Currently good at yes/no question, not so much free form and counting
VQA 2.0

§ Just guessing without an image lead to ~51% accuracy

§ So the V in VQA “only” adds 14% increase in accuracy
VQA 2.0

§ Just guessing without an image lead to ~51% accuracy

§ So the V in VQA “only” adds 14% increase in accuracy
§ VQA v2.0 is attempting to address this
Multimodal QA – other VQA datasets
Multimodal QA – other VQA datasets (C7)
§ TVQA
§ Video QA dataset based on 6 popular TV shows
§ 152.5K QA pairs from 21.8K clips
§ Compositional questions
Multimodal QA – Visual Reasoning (C8)
§ VCR: Visual Commonsense Reasoning
§ Model must answer challenging visual questions expressed in
language
§ And provide a rationale explaining why its answer is true.
Social-IQ (A10)

§ Social-IQ: 1.2k videos, 7.5k questions, 50k answers

§ Questions and answers centered around social behaviors
What are the Core Challenges Most Involved in
Multimodal QA?

Big dog
on the
beach

Prediction 2 1
𝑡#

𝑡!
𝑡"
Project Example: Adversarial Attacks on VQA models

Research task: Adversarial Attacks on VQA models

Datasets: VQA
Main idea: Test the robustness of VQA models to adversarial attacks on the
image.

Vasu Sharma, Ankita Kalra, Vaibhav, Simral Chaudhary, Labhesh Patel, Louis-Philippe Morency, Attend and Attack:
Attention Guided Adversarial Attacks on Visual Question Answering Models. NeurIPS ViGIL workshop 2018.
https://nips2018vigil.github.io/static/papers/accepted/33.pdf
Project Example: Adversarial Attacks on VQA models

Research task: Adversarial Attacks on VQA models

Datasets: VQA
Main idea: Test the robustness of VQA models to adversarial attacks on the
image.
Q: what kind of flowers are in the vase?

+ VQA model

A: Roses to Sunflower

How can we design a targeted attack on images in VQA models, which will
help in assessing robustness of existing models?

Solution: Use fusion over original image and question to generate an

adversarial perturbation map over the image

Hypothesis:
question helps to
localize important
visual regions for
targeted
adversarial attacks

Adversarial perturbation map

The next generation of AI assistants need to

interact with the real (or virtual?) world.

61
Language, Vision and Actions

User: Go to the entrance of the

lounge area.

Robot: Sure. I think I’m there. What

else?

User: On your right there will be a

bar. On top of the counter, you will
see a box. Bring me that.

62
Many Technical Challenges

Instruction:
Find the window. Look left at the cribs. Search for the tricolor crib. The target is below that crib.

Instruction generation
Instruction following

Linking Action-Language-Vision

action action action

View point 0 View point 1 View point 2 View point 3

Navigating in a Virtual House

Visually-grounded natural language navigation in real buildings

§ Room-2-Room: 21,567 open vocabulary, crowd-sourced
navigation instructions
Multiple Step Instructions

Refer360 Dataset

Step1
place the door leading
outside to center. 1
Step2
notice the silver and black
coffee pot closest to you
on the bar. see the black 2
trash bin on the floor in
front of the coffee pot
3
Step3
waldo is on the face of the 4
trash bin about 1 foot off
the floor and also slightly
on the brown wood
Language meets Games

Interactive game playing RL agents with language input

Heinrich Kuttler and Nantas Nardelli and Alexander H. Miller and Roberta Raileanu and Marco Selvatici and Edward Grefenstette
and Tim Rocktaschel, The Nethack Learning Environment. https://arxiv.org/abs/2006.13760
Language meets Games

Agents who must speak and act in a game

Shrimai Prabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, Arthur Szlam. I love your chain mail!
Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents. https://arxiv.org/abs/2002.02878
What are the Core Challenges Most Involved in
Multimodal Navigation?

Big dog
on the
beach

Prediction 2 1
𝑡#

𝑡!
𝑡"
Project Example: Instruction Following

Research task: Task-Oriented Language Grounding in an Environment

Datasets: ViZDoom, based on the Doom video game
Main idea: Build a model that comprehends natural language instructions, grounds
the entities and relations to the environment, and execute the instruction.

Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov,
Gated-Attention Architectures for Task-Oriented Language Grounding. AAAI 2018 https://arxiv.org/abs/1706.07230
Project Example: Instruction Following

Solution: Gated attention architecture to attend to instruction and states

Hypothesis: Gated attention learns to ground and compose attributes in

natural language with the image features. e.g. learning grounded
representations for ‘green’ and ‘torch’.

Research task: Multiagent trajectory forecasting for autonomous driving

Datasets: Argoverse and Nuscenes autonomous driving datasets
Main idea: Build a model that understands the environment and multiagent
trajectories and predicts a set of multimodal future trajectories for each agent.

Seong Hyeon Park, Gyubok Lee, Manoj Bhat, Jimin Seo, Minseok Kang, Jonathan Francis, Ashwin R. Jadhav, Paul Pu Liang,
Louis-Philippe Morency, Diverse and Admissible Trajectory Forecasting through Multimodal Context Understanding. ECCV 2020
https://arxiv.org/abs/1706.07230
Project Example: Multiagent Trajectory Forecasting

Solution: Modeling the environment and multiple agents to learn a distribution of

future trajectories for each agent.

Hypothesis: both
agent-agent
interactions and agent-
scene interactions are
important!

A. Affect Recognition B. Media Description

AFEW A1 MSCOCO B1
AVEC A2 MPII B2
IEMOCAP A3 MONTREAL B3
POM A4 LSMDC B4
MOSI A5 CHARADES B5
CMU-MOSEI A6 REFEXP B6
TUMBLR A7 GUESSWHAT B7
AMHUSE A8 FLICKR30K B8
VGD A9 CSI B9
Social-IQ A10 MVSQ B10
MELD A11 NeuralWalker B11
MUStARD A12 Visual Relation B12
DEAP A14 Visual Genome B13
MAHNOB A15 Pinterest B14
Continuous LIRIS-ACCEDE A16 Movie Graph B15
DECAF A17 Nocaps B16
ASCERTAIN A18 CrossTalk B17
AMIGOS A19 Refer360 B18
Our Latest List of Multimodal Datasets

C. Multimodal QA D. Multimodal Navigation

VQA C1
DAQUAR C2 Room-2-Room D1
COCO-QA C3 RERERE D2
MADLIBS C4 VNLA D3
TEXTBOOK C5 nuScenese D4
VISUAL7W C6 Waymo D5
TVQA C7 CARLA D6
VCR C8 Argoverse D7
Cornell NLVR C9 ALFRED D8
CLEVR C10
EQA C11
TextVQA C12
GQA C13
CompGuessWhat C14
Our Latest List of Multimodal Datasets

E. Multimodal Dialog
VISDIAL E1
Talk the Walk E2 G. Cross-media Retrieval
Vision-and-Dialog Navigation E3
CLEVR-Dialog E4 IKEA G1
Fashion Retrieval E5 MIRFLICKR G2
NUS-WIDE G3
YAHOO-FLICKR G4
F. Event Detection YOUTUBE-8M G5
YOUTUBE-BOUNDING G6
WHATS-COOKING F1 YOUTUBE-OPEN G7
TACOS F2 VIST G8
TACOS-MULTI F3 Recipe1M+ G9
YOU-COOK F4 VATEX G10
MED F5
TITLE-VIDEO-SUMM F6
MEDIA-EVAL F7
CRISSMMD F8

… and please let us know (via Piazza) when you find more!
More Project Examples
See the course website:
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/projects/
Some Advice About Multimodal Research

§ Think more about the research problems, and less about

the datasets themselves
§ Aim for generalizable models across several datasets
§ Aim for models inspired by existing research e.g. psychology
§ Some areas to consider beyond performance:
§ Robustness to missing/noisy modalities, adversarial attacks
§ Studying social biases and creating fairer models
§ Interpretable models
§ Faster models for training/storage/inference
§ Theoretical projects are welcome too – make sure there
are also experiments to validate theory
Some Advice About Multimodal Datasets

§ If you are used to deal with text or speech

§ Space will become an issue working with image/video data
§ Some datasets are in 100s of GB (compressed)
§ Memory for processing it will become an issue as well
§ Won’t be able to store it all in memory
§ Time to extract features and train algorithms will also
become an issue
§ Plan accordingly!
§ Sometimes tricky to experiment on a laptop (might need to do
it on a subset of data)
Available Tools

§ Use available tools in your research groups

§ Or pair up with someone that has access to them
§ Find some GPUs!
§ We will be getting AWS credit for some extra
computational power
§ Google Cloud Platform credit as well
Upcoming Course Assignments

Project preferences (deadline Tuesday 9/8 at 8pm ET)

§ Let us know about your project preferences, including datasets,
research topics and potential teammates
§ See instructions on Piazza
§ We will reserve a moment for discussions on Thursday 9/10 to help
you with finding project teammates
Reading Assignment (Summaries due Friday 9/11 at 8pm ET)
§ We created the study groups in Piazza.
§ End of the discussion period: Monday 9/14 at 8pm ET
Lecture Highlights (for both lectures next week)
§ Starting next week, you need to post your lecture highlights
following each course lecture. See Piazza for detailed instructions.
END
of Today’s Lecture

82
Appendix: List of
Multimodal datasets
83
Affect recognition dataset 1 (A1)

§ AFEW – Acted Facial Expressions in the

Wild (part of EmotiW Challenge)
§ Audio-Visual emotion labels – acted
emotion clips from movies
§ 1400 video sequences of about 330
subjects
§ Labelled for six basic emotions + neutral
§ Movies are known, can extract the
subtitles/script of the scenes
§ Part of EmotiW challenge
Affect recognition dataset 2 (A2)

§ Three AVEC challenge datasets 2011/2012,

2013/2014, 2015, 2016, 2017, 2018
§ Audio-Visual emotion recognition
§ Labeled for dimensional emotion (per
AVEC 2011/2012
frame)
§ 2011/2012 has transcripts
§ 2013/2014/2016 also includes depression
labels per subject
AVEC 2013/2014
§ 2013/2014 reading specific text in a subset
of videos
§ 2015/2016 includes physiological data
§ 2017/2018 includes depression/bipolar

AVEC 2015/2016
Affect recognition dataset 3 (A3)

§ The Interactive Emotional Dyadic Motion

Capture (IEMOCAP)
§ 12 hours of data, but only 10 participants
§ Video, speech, motion capture of face, text
transcriptions
§ Dyadic sessions where actors perform
improvisations or scripted scenarios
§ Categorical labels (6 basic emotions plus
excitement, frustration) as well as dimensional
labels (valence, activation and dominance)
§ Focus is on speech
Affect recognition dataset 4 (A4)

§ Persuasive Opinion Multimedia (POM)

§ 1,000 online movie review videos
§ A number of speaker traits/attributes
labeled – confidence, credibility, passion,
persuasion, big 5…
§ Video, audio and text
§ Good quality audio and video recordings
Affect recognition dataset 5 (A5)

§ Multimodal Corpus of Sentiment

§ Multimodal sentiment and emotion recognition

§ CMU-MOSEI : 23,453 annotated video segments from 1,000
distinct speakers and 250 topics
Tumblr Dataset: Sentiment and Emotion Analysis (A7)

§ Tumblr Dataset – Tumblr

posts with images and
emotion word tags.
§ 256,897 posts with images.
§ Labels obtained from 15
categories of emotion word
tags.
§ Dataset not directly available
but code for collecting the
dataset is provided.
AMHUSE Dataset: Multimodal Humor Sensing (A8)

§ AMHUSE – Multimodal humor sensing.

§ Include various modalities:
§ Video from RGB-d camera, but no audio/language
§ Sensory data: blood volume pulse, electrodermal activity, etc.
§ Time series of 36 recipients during 4 different stimuli.
§ Continuous annotations of arousal, dominance through
out each time series. Case-level annotation of level of
pleasure is also available.
Video Game Dataset: Multimodal Game Rating (A9)

§ VGD – Video Game Dataset,

game rating based on text
and trailer screenshots.
§ 1,950 game trailers.
§ Labelled for score ranges of
the game, based on online
critics.
Social-IQ (A10)

§ Social-IQ: 1.2k videos, 7.5k questions, 50k answers

§ Questions and answers centered around social behaviors
MELD (A11)

§ MELD: Multi-party dataset for emotion recognition in

conversations
MUStARD (A12)

§ MUStARD: Multimodal sarcasm dataset

More affect recognition datasets (A13-A18)
§ DEAP (A13)
§ Emotion analysis using EEG, physiological, and video signals
§ MAHNOB (A14)
§ Laughter database
§ Continuous LIRIS-ACCEDE (A15)
§ Induced valence and arousal self-assessments for 30 movies
§ DECAF (A16)
§ MEG + near-infra-red facial videos + ECG + … signals
§ ASCERTAIN (A17)
§ Personality and affect recognition from physiological sensors
§ AMIGOS (A18)
§ Affect, personality, and mood from neuro-physiological signals
§ EMOTIC (A19)
§ Context Based Emotion Recognition
Media description dataset 1 – MS COCO (B1)
§ Microsoft Common Objects in COntext (MS COCO)
§ 120000 images
§ Each image is accompanied with five free form sentences
describing it (at least 8 words)
§ Sentences collected using crowdsourcing (Mechanical Turk)
§ Also contains object detections, boundaries and keypoints
Media description dataset 2 - Video captioning (B2&B3)

§ MPII Movie Description dataset (B2)

§ A Dataset for Movie Description

§ Montréal Video Annotation dataset (B3)

§ Using Descriptive Video Services to Create a Large Data Source for Video
Annotation Research
Media description dataset 2 - Video captioning (B2&B3)

§ Both based on audio descriptions for

the blind (Descriptive Video Service -
DVS tracks)
§ MPII – 70k clips (~4s) with
corresponding sentences from 94
movies
§ Montréal – 50k clips (~6s) with
corresponding sentences from 92
movies
§ Not always well aligned
§ Quite noisy labels
§ Single caption per clip
Media description dataset 2 - Video captioning (B4)

§ Large Scale Movie Description and Understanding Challenge

(LSMDC) hosted at ECCV 2016 and ICCV 2015
§ Combines both of the datasets and provides three challenges
§ Movie description
§ Movie annotation and Retrieval
§ Movie Fill-in-the-blank
§ Nice challenge, but beware
§ Need a lot of computational power
§ Processing will take space and time
Charades Dataset – video description dataset (B5)

§ http://allenai.org/plato/charades/
§ 9848 videos of daily indoors activities
§ 267 different users
§ Recording videos at home
§ Home quality videos
Media Description – Referring Expression datasets (B6)

§ Referring Expressions:
§ Generation (Bounding Box to Text) and Comprehension (Text to
Bounding Box)
§ Generate / Comprehend a noun phrase which identifies a particular
object in an image
§ Many datasets!
§ RefClef
§ RefCOCO (+, g)
§ GRef
Media Description - Referring Expression datasets (B7)

§ GuessWhat?!
§ Cooperative two-player
guessing game for language
grounding
§ Locate an unknown object in a
rich image scene by asking a
sequence of questions
§ 821,889 questions+answers
§ 66,537 images and 134,073
objects
Media Description - other datasets (B8)
§ Flickr30k Entities
§ Region-to-Phrase Correspondences for Richer Image-to-Sentence
Models
§ 158k captions
§ 244k coreference chains
§ 276k manually annotated bounding boxes
CSI Corpus (B9)

§ CSI-Corpus: 39 videos from the U.S. TV show “Crime Scene

Investigation Las Vegas”
§ Data: Sequence of inputs comprising information from
different modalities such as text, video, or audio.The task is
to predict for each input whether the perpetrator is mentioned
or not.
Other Media Description Datasets (B10-B14)
§ MVSO (B10): Multilingual Visual Sentiment Ontology. There are
multiple derivatives of this as well
§ NeuralWalker (B11): 'Listen, Attend, and Walk: Neural Mapping of
Navigational Instructions to Action Sequences’
§ Visual Relation dataset (B12): learning relations between objects
based on language priors.
§ Visual genome (B13) Great resource for many multimodal
problems.
§ Pinterest (B14): Contains 300 million sentences describing over 40
million 'pins'
§ nocaps (B16): novel object captioning at scale
§ CrossTask (B17): procedure annotations in videos
§ Refer360° (B18): Referring Expression Recognition in 360° Images
Visual Genome (B13)

§ https://visualgenome.org/
MovieGraph dataset (B15)

§ http://moviegraphs.cs.toronto.edu/
Media description technical challenges

§ What technical problems could be addressed?

§ Translation
§ Representation
§ Alignment
§ Co-training/transfer learning
§ Fusion
Multimodal QA dataset 1 – VQA (C1)

§ Task - Given an image and a question, answer

the question (http://www.visualqa.org/)
Multimodal QA dataset 1 – VQA (C1)

§ Two challenges organized these past two years (link)

§ Currently good at yes/no question, not so much free form and counting
VQA 2.0

§ Just guessing without an image lead to ~51% accuracy

§ So the V in VQA “only” adds 14% increase in accuracy
§ VQA v2.0 is attempting to address this
Multimodal QA – other VQA datasets
Multimodal QA – other VQA datasets (C2&C3)

§ DAQUAR (C2)
§ Synthetic QA pairs based on templates
§ 12468 human question-answer pairs

§ COCO-QA (C3)
§ Object, Number, Color, Location
§ Training: 78736
§ Test: 38948
Multimodal QA – other VQA datasets (C4)

§ Visual Madlibs
§ Fill in the blank Image Generation
and Question Answering
§ 360,001 focused natural language
descriptions for 10,738 images
§ collected using automatically
produced fill-in-the-blank
templates designed to gather
targeted descriptions about:
people and objects, their
appearances, activities, and
interactions, as well as inferences
about the general scene or its
broader context
Multimodal QA – other VQA datasets (C5)
§ Textbook Question Answering
§ Multi-Modal Machine Comprehension
§ Context needed to answer questions provided and composed of both
text and images
§ 78338 sentences, 3455 images
§ 26260 questions
Multimodal QA – other VQA datasets (C6)
§ Visual7W
§ Grounded Question Answering in Images
§ 327,939 QA pairs on 47,300 COCO images
§ 1,311,756 multiple-choices, 561,459 object groundings,
36,579 categories
§ what, where, when, who, why, how and which
Multimodal QA – other VQA datasets (C7)
§ TVQA
§ Video QA dataset based on 6 popular TV shows
§ 152.5K QA pairs from 21.8K clips
§ Compositional questions
Multimodal QA – Visual Reasoning (C8)
§ VCR: Visual Commonsense Reasoning
§ Model must answer challenging visual questions expressed in
language
§ And provide a rationale explaining why its answer is true.
Multimodal QA – Visual Reasoning (C9)
§ Cornell NLVR
§ 92,244 pairs of natural language statements grounded in
synthetic images
§ Determine whether a sentence is true or false about an image
Multimodal QA – Visual Reasoning (C10)

§ CLEVR
§ A Diagnostic Dataset
for Compositional Language
and Elementary Visual
Reasoning
§ Tests a range of different
specific visual reasoning
abilities
§ Training set: 70,000 images
and 699,989 questions
§ Validation set: 15,000 images
and 149,991 questions
§ Test set: 15,000 images and
14,988 questions
Embodied Question Answering (C11)

§ An agent is spawned at a random location in a 3D environment

and asked a question
§ EQA v1.0: 9,000 questions from 774 environments
TextVQA (C12), GQA (C13), CompGuessWhat (C14)

§ TextVQA requires models to read and reason about text in

images to answer questions about them. Specifically, models
need to incorporate a new modality of text present in the
images and reason over it to answer TextVQA questions.

§ GQA Real-World Visual Reasoning and Compositional Question

Answering. A new dataset for real-world visual reasoning and
compositional question answering, seeking to address key
shortcomings of previous VQA datasets.
§ CompGuessWhat Framework for evaluating the quality of learned
neural representations, in particular concerning attribute grounding.
Multimodal QA technical challenges

§ What technical problems could be addressed?

§ Translation
§ Representation
§ Alignment
§ Fusion
§ Co-training/transfer learning
Room-2-Room Navigation with NL instructions (D1)

§ Visually grounded natural

language navigation in real
buildings
§ Room-2-Room: 21,567 open
vocabulary, crowd-sourced
navigation instructions
Multimodal Navigation: RERERE (D2)

§ Remote embodied referring expressions in real indoor

environments
Multimodal Navigation: VNLA (D3)

§ Vision-based navigation with language-based assistance

Autonomous driving: nuScenes (D4)

§ Multimodal dataset for autonomous driving

Autonomous driving: Waymo Open Dataset (D5)

§ Autonomous vehicle dataset

§ 1000 driving segments
§ 5 cameras and 5 lidar inputs
§ Dense labels for vehicles, pedestrians, cyclists, road signs.
Autonomous driving: CARLA (D6)

§ Simulator for autonomous driving research

§ 3 sensing modalities: normal vision camera, ground-truth
depth, and ground-truth semantic segmentation
Autonomous driving: Argoverse (D7)

§ Autonomous vehicle dataset

§ 3D tracking annotations for 113 scenes and 327,793
interesting vehicle trajectories for motion forecasting
§ Input modalities: LiDAR measurements, 360◦ RGB video,
front-facing stereo, and 6-dof localization
ALFRED (D8)

§ ALFRED Instruction following with long trajectories and basic

affordances
Multimodal Navigation technical challenges

§ What technical problems could be addressed?

§ Translation
§ Representation
§ Alignment
§ Co-training/transfer learning
§ Fusion
Multimodal Dialog: Visual Dialog (E1)

§ VisDial v0.9: total of ∼1.2M

dialog question-answer pairs
(1 dialog with 10 question-
answer pairs on ∼120k images
from MS-COCO)
§ VisDial v1.0 has also been
released recently
§ A Visual Dialog Challenge is
organized at ECCV 2018
Multimodal Dialog: Talk the Walk (E2)

§ A guide and a tourist communicate via natural language to

navigate the tourist to a given target location. (paper)
Cooperative Vision-and-Dialog Navigation (E3)

§ 2k embodied, human-human dialogs situated in simulated,

photorealistic home environments. (code+data)
§ Agent has to navigate towards the goal
Multimodal Dialog: CLEVR-Dialog (E4)

§ Used to benchmark visual coreference resolution.

(code+data)
Multimodal Dialog: Fashion Retrieval (E5)

§ Fashion retrieval dataset

§ Dialog-based interactive image retrieval
Multimodal Dialog technical challenges

§ What technical problems could be addressed?

§ Representation
§ Alignment
§ Translation
§ Co-training/transfer learning
§ Fusion
Event detection

§ Given video/audio/ text

detect predefined events or
scenes
§ Segment events in a stream
§ Summarize videos
Event detection dataset 1 (F1, F2, F3 & F4)

§ What’s Cooking (F1)- cooking action

dataset
§ melt butter, brush oil, etc.
§ taste, bake etc.
§ Audio-visual, ASR captions
§ 365k clips
§ Quite noisy
§ Surprisingly many cooking
datasets:
§ TACoS (F2), TACoS Multi-
Level (F3), YouCook (F4)
Event detection dataset 2 (F5)

§ Multimedia event detection

§ TrecVid Multimedia Event Detection (MED) 2010-
2015
§ One of the six TrecVid tasks
§ Audio-visual data
§ Event detection
Event detection dataset 3 (F6)

§ Title-based Video
Summarization dataset
§ 50 videos labeled for
scene importance, can be
used for summarization
based on the title
Event detection dataset 4 (F7)

§ MediaEval challenge datasets

§ Affective Impact of Movies (including Violent
Scenes Detection)
§ Synchronization of Multi-User Event Media
§ Multimodal Person Discovery in Broadcast TV
CrisisMMD: Natural Disaster Assessment (F8)

§ CrisisMMD – Multimodal Dataset for

Natural Disasters
§ 16,097 Twitter posts with one or more
images
§ Annotations comprises of 3 types:
§ Informative vs. Uninformative for
humanitarian aid purposes
§ Humanitarian aid categories
§ Damage Assessment
Event detection technical challenges

§ What technical problems could be addressed?

§ Fusion
§ Representation
§ Co-learning
§ Mapping
§ Alignment (after misaligning)
Cross-media retrieval

§ Given one form of media retrieve related forms of media,

given text retrieve images, given image retrieve relevant
documents
§ Examples:
§ Image search
§ Similar image search
§ Additional challenges
§ Space and speed considerations
Multimodal Retrieval: IKEA Interior Design Dataset (G1)

§ Interior Design Dataset – Retrieve desired product using

room photos and text queries.
§ 298 room photos, 2193 product images/descriptions.
Cross-media retrieval datasets (G2, G3, G4)

§ MIRFLICKR-1M (G2)
§ 1M images with associated tags and captions
§ Labels of general and specific categories
§ NUS-WIDE dataset (G3)
§ 269,648 images and the associated tags from Flickr, with a
total number of 5,018 unique tags;
§ Yahoo Flickr Creative Commons 100M (G4)
§ Videos and images
§ Can also use image and video captioning datasets
§ Just pose it as a retrieval task
Other Multimodal Datasets (G5, G6, G7, G8, G9, G10)

§ 1) YouTube 8M (G5)
§ https://research.google.com/youtube8m/
§ 2) YouTube Bounding Boxes (G6)
§ https://research.google.com/youtube-bb/
§ 3) YouTube Open Images (G7)
§ https://research.googleblog.com/2016/09/introducing-open-
images-dataset.html
§ 4) VIST (G8)
§ http://visionandlanguage.net/VIST/
§ 5) Recipe1M+ (G9)
§ http://pic2recipe.csail.mit.edu/
§ 6) VATEX (G10)
§ https://eric-xw.github.io/vatex-website/
Cross-media retrieval challenges

§ What technical problems could be addressed?

§ Representation
§ Translation
§ Alignment
§ Co-learning
§ Fusion
CMU · 11-777 | Multimodal Machine Learning (2020)

CMU 11-777(2020)· 课程资料包 @ShowMeAI

机器学习深度学习自然语言处理计算机视觉

视频课件笔记代码
中英双语字幕一键打包下载官方笔记翻译作业项目解析 Stanford · CS229 Stanford · CS230 Stanford · CS224n Stanford · CS231n

# Awesome AI Courses Notes Cheatsheets· 持续更新中

视频 · B 站 [ 扫码或点击链接 ] 知识图谱图机器学习深度强化学习自动驾驶

https://www.bilibili.com/video/BV1Pf4y1P7zc
Stanford · CS520 Stanford · CS224W UCBerkeley · CS285 MIT · 6.S094

课件 & 代码 · 博客 [ 扫码或点击链接 ]
http://blog.showmeai.tech/cmu-11-777

微信公众号
协同学习图模型
多模态强化学习判别模型

生成模型深度生成模型
CNN 资料下载方式 2：扫码点击底部菜单栏
称为 AI 内容创作者？回复 [ 添砖加瓦 ]
RNN 多模态强化学习视觉表示语言表示

Lecture1 2-MultimodalResearchTasks
No ratings yet
Lecture1 2-MultimodalResearchTasks
46 pages
CLSDD
No ratings yet
CLSDD
26 pages
Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
No ratings yet
Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
26 pages
Lecture1 1-Introduction
No ratings yet
Lecture1 1-Introduction
52 pages
MMML Tutorial ACL2017
No ratings yet
MMML Tutorial ACL2017
221 pages
Lecture2.1 - Basic Concepts - Neural Networks
No ratings yet
Lecture2.1 - Basic Concepts - Neural Networks
66 pages
Introduction To Multimodal RAG
No ratings yet
Introduction To Multimodal RAG
12 pages
Untitled Document
No ratings yet
Untitled Document
17 pages
Presentation 4
No ratings yet
Presentation 4
71 pages
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
No ratings yet
Semantic Speech Analysis Using Machine Learning and Deep Learning Techniques: A Comprehensive Review
30 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
Lecture1 2-MultimodalResearch
No ratings yet
Lecture1 2-MultimodalResearch
77 pages
MM 4
No ratings yet
MM 4
7 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Affective Comping Book
No ratings yet
Affective Comping Book
772 pages
1 s2.0 S1566253517300738 Main
No ratings yet
1 s2.0 S1566253517300738 Main
28 pages
Emotion Recognition From Multiple Modalities: Fundamentals and Methodologies
No ratings yet
Emotion Recognition From Multiple Modalities: Fundamentals and Methodologies
13 pages
A Snapshot Research and Implementation of Multimodal Information Fusion For Data-Driven Emotion Recognition PDF
No ratings yet
A Snapshot Research and Implementation of Multimodal Information Fusion For Data-Driven Emotion Recognition PDF
13 pages
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
No ratings yet
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
120 pages
ChatGPT With Reinforcment Learning
No ratings yet
ChatGPT With Reinforcment Learning
71 pages
CV 04 2020
No ratings yet
CV 04 2020
3 pages
Survey Paper
No ratings yet
Survey Paper
10 pages
Natural Language Processing: Class X
No ratings yet
Natural Language Processing: Class X
18 pages
Lecture-27-Introduction To VLM
No ratings yet
Lecture-27-Introduction To VLM
46 pages
(IJCST-V10I2P10) :mukkera Sandeep, Munigala Akshay Reddy, Lingampally Teja Rao, Surbhi Sharma
No ratings yet
(IJCST-V10I2P10) :mukkera Sandeep, Munigala Akshay Reddy, Lingampally Teja Rao, Surbhi Sharma
4 pages
Detecting Emotions Expressed in News Presentations Using Convolution Neural Networks (CNN)
No ratings yet
Detecting Emotions Expressed in News Presentations Using Convolution Neural Networks (CNN)
8 pages
2 PDF
No ratings yet
2 PDF
7 pages
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
No ratings yet
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
11 pages
Getting Started With Generative AI As A Teaching Partner
No ratings yet
Getting Started With Generative AI As A Teaching Partner
51 pages
Multi Model
No ratings yet
Multi Model
36 pages
A Review On EEG-based Multimodal Learning For Emotion Recognition
No ratings yet
A Review On EEG-based Multimodal Learning For Emotion Recognition
63 pages
Ai 1
No ratings yet
Ai 1
22 pages
Project-Human Emotion Detection
No ratings yet
Project-Human Emotion Detection
28 pages
Multimodal Machine Learning A Survey and Taxonomy
No ratings yet
Multimodal Machine Learning A Survey and Taxonomy
21 pages
PRE Final Year Project
No ratings yet
PRE Final Year Project
23 pages
Report in ML
No ratings yet
Report in ML
9 pages
A Comprehensive Survey of Multimodal Large Languag
No ratings yet
A Comprehensive Survey of Multimodal Large Languag
53 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
Emotion Classification Using Deep Learning Technique
No ratings yet
Emotion Classification Using Deep Learning Technique
21 pages
Lec-All Deep Learning Coursework
100% (2)
Lec-All Deep Learning Coursework
639 pages
A Mini Project Report
No ratings yet
A Mini Project Report
10 pages
Pro Mahi (1) - 1
No ratings yet
Pro Mahi (1) - 1
35 pages
Harish Project Frony Sheets
No ratings yet
Harish Project Frony Sheets
6 pages
Ui Ux Report
No ratings yet
Ui Ux Report
21 pages
2025 04 22 Intro To LLMsv1
No ratings yet
2025 04 22 Intro To LLMsv1
41 pages
Generative AI System Design Resources
No ratings yet
Generative AI System Design Resources
5 pages
Lecture 20
No ratings yet
Lecture 20
12 pages
Final Abstract Submission TEITA 2022
No ratings yet
Final Abstract Submission TEITA 2022
2 pages
Project 1
No ratings yet
Project 1
26 pages
Minor Project File Format 2025
No ratings yet
Minor Project File Format 2025
45 pages
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
No ratings yet
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
45 pages
Exploring The Power of Cross-Contextual Large Language Model in Mimic Emotion Prediction
No ratings yet
Exploring The Power of Cross-Contextual Large Language Model in Mimic Emotion Prediction
8 pages
Nidhish Resume NC
No ratings yet
Nidhish Resume NC
1 page
26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
No ratings yet
26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
20 pages
Rupesh Research
No ratings yet
Rupesh Research
4 pages
Jurnal 2 Enggres
No ratings yet
Jurnal 2 Enggres
10 pages
Thesis
No ratings yet
Thesis
76 pages
OpenCV By Example
From Everand
OpenCV By Example
Prateek Joshi
No ratings yet
Mastering OpenCV 3 - Second Edition
From Everand
Mastering OpenCV 3 - Second Edition
Daniel Lélis Baggio
No ratings yet
Poppy Nature Study by Books and Willows
No ratings yet
Poppy Nature Study by Books and Willows
16 pages
Astm F 1145
100% (2)
Astm F 1145
12 pages
Fluid Mechanics
No ratings yet
Fluid Mechanics
9 pages
EDST2003 Week 1 Final
No ratings yet
EDST2003 Week 1 Final
54 pages
The Teaching Profession 2
No ratings yet
The Teaching Profession 2
11 pages
Case Study 21. Human Resource Planning R PDF
0% (1)
Case Study 21. Human Resource Planning R PDF
3 pages
Philippine Indigenous Craft - ICC
No ratings yet
Philippine Indigenous Craft - ICC
8 pages
Proc - Installation For Solar Heater
No ratings yet
Proc - Installation For Solar Heater
7 pages
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
No ratings yet
HIV Prevention in Ethiopia National Road Map 2018 - 2020 FINAL - FINAL
52 pages
Carbolite Furnace Manual
No ratings yet
Carbolite Furnace Manual
16 pages
7 Key Principles of Apparel Costing - Textile Tutorials
No ratings yet
7 Key Principles of Apparel Costing - Textile Tutorials
2 pages
Further Studies Maths P1 Memo 2024
No ratings yet
Further Studies Maths P1 Memo 2024
19 pages
Photometry and Instrumentation.V2
No ratings yet
Photometry and Instrumentation.V2
28 pages
Customers To Be Linkedfinal
No ratings yet
Customers To Be Linkedfinal
8 pages
Least Squares & Pseudo Inverse
No ratings yet
Least Squares & Pseudo Inverse
12 pages
General Ledger of Journal 1
No ratings yet
General Ledger of Journal 1
8 pages
Info Age
No ratings yet
Info Age
31 pages
Senior Software Engineer Web Api 11
No ratings yet
Senior Software Engineer Web Api 11
7 pages
Kirubel
No ratings yet
Kirubel
26 pages
Bus Paper Craft
No ratings yet
Bus Paper Craft
10 pages
BR Gaswellblowoutfire
No ratings yet
BR Gaswellblowoutfire
8 pages
Irosin Wacs
No ratings yet
Irosin Wacs
22 pages
Lessons From Gattinoni
No ratings yet
Lessons From Gattinoni
28 pages
True Experimental Design
75% (4)
True Experimental Design
2 pages
NCP Making
No ratings yet
NCP Making
2 pages
Iso 20136-2017
No ratings yet
Iso 20136-2017
28 pages
Intermediate Relay: Wiring Diagram
No ratings yet
Intermediate Relay: Wiring Diagram
1 page
Files Reviewer
No ratings yet
Files Reviewer
24 pages
Berger Paint Project
100% (2)
Berger Paint Project
144 pages
Bioplastic 2
No ratings yet
Bioplastic 2
13 pages

Lecture1.2 - Multimodal Research Tasks

Uploaded by

Lecture1.2 - Multimodal Research Tasks

Uploaded by

CMU · 11-777 | Multimodal Machine Learning (2020)

CMU 11-777(2020)· 课程资料包 @ShowMeAI

机器学习 深度学习 自然语言处理 计算机视觉

# Awesome AI Courses Notes Cheatsheets· 持续更新中

视频 · B 站 [ 扫码或点击链接 ] 知识图谱 图机器学习 深度强化学习 自动驾驶

* Original version co-developed with Tadas Baltrusaitis

§ Understand the breath of possible tasks for

§ 3 paper options are available

Four main steps for the reading assignments

Details posted on Piazza

§ Students should summarize lecture highlights

Detailed instructions were also posted on Piazza

§ Today: Lecture describing available multimodal

§ Pre-proposal (Wednesday 9/16)

Four eras of multimodal research

Ø The “computational” era (late 1980s until 2000)

Ø The “interaction” era (2000 - 2010)

Ø The “deep learning” era (2010s until …)

1970 1980 1990 2000 2010

1970 1980 1990 2000 2010

Image Visual Language,

Video Video QA & Large-scale Self-driving

2015 2016 2017 2018 2019

§ Affective states – emotions, moods, and feelings

§ Affective states § Rapport

§ Part of a larger SEMAINE corpus

§ Reading specific text in a subset of

§ Multimodal Corpus of Sentiment

§ Multimodal sentiment and emotion recognition

§ MELD: Multi-party dataset for emotion recognition in

Legend What rules can you infer from this data?

Solution: Learning representations that reduce the effect of user identity

Research task: Multimodal sentiment analysis

Main idea: Estimating importance of each modality at the word-level in a video.

How can we build an interpretable model that estimates modality and

Modality gates that

§ Given a media (image, video, audio-visual clips)

§ Has an evaluation server

§ A challenge was done with actual human

§ A challenge was done with actual human

§ What about automatic evaluation?

• Alignment is a challenge since description

Let’s ask MTurk users to “act” the description!

Charade Dataset: http://allenai.org/plato/charades/

First author was student in first edition of MMML course!

Referring Expressions: Generate / Comprehend a noun

This is related to “grounding” which links linguistic elements to

§ Task - Given an image and a question, answer

§ Two challenges organized these past two years (link)

§ Just guessing without an image lead to ~51% accuracy

§ Just guessing without an image lead to ~51% accuracy

§ Social-IQ: 1.2k videos, 7.5k questions, 50k answers

Research task: Adversarial Attacks on VQA models

Research task: Adversarial Attacks on VQA models

Solution: Use fusion over original image and question to generate an

Adversarial perturbation map

The next generation of AI assistants need to

User: Go to the entrance of the

Robot: Sure. I think I’m there. What

User: On your right there will be a

action action action

View point 0 View point 1 View point 2 View point 3

Visually-grounded natural language navigation in real buildings

Interactive game playing RL agents with language input

Agents who must speak and act in a game

Research task: Task-Oriented Language Grounding in an Environment

Solution: Gated attention architecture to attend to instruction and states

Hypothesis: Gated attention learns to ground and compose attributes in

Research task: Multiagent trajectory forecasting for autonomous driving

Solution: Modeling the environment and multiple agents to learn a distribution of

A. Affect Recognition B. Media Description

C. Multimodal QA D. Multimodal Navigation

§ Think more about the research problems, and less about

§ If you are used to deal with text or speech

§ Use available tools in your research groups

Project preferences (deadline Tuesday 9/8 at 8pm ET)

机器学习深度学习自然语言处理计算机视觉

视频 · B 站 [ 扫码或点击链接 ] 知识图谱图机器学习深度强化学习自动驾驶

机器学习深度学习自然语言处理计算机视觉

视频 · B 站 [ 扫码或点击链接 ] 知识图谱图机器学习深度强化学习自动驾驶