0% found this document useful (0 votes)
4 views10 pages

AdaMML Adaptive Multi-Modal Learning For Efficient Video Recognition

The document presents AdaMML, an adaptive multi-modal learning framework designed to enhance video recognition efficiency by selectively using relevant modalities for each video segment. By employing a multi-modal policy network that determines the optimal modalities on-the-fly, the approach achieves significant computational savings (35%-55% reduction) while maintaining or improving accuracy compared to traditional methods. Extensive experiments demonstrate AdaMML's effectiveness across various video recognition benchmarks, highlighting its potential for real-world applications.

Uploaded by

sunyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

AdaMML Adaptive Multi-Modal Learning For Efficient Video Recognition

The document presents AdaMML, an adaptive multi-modal learning framework designed to enhance video recognition efficiency by selectively using relevant modalities for each video segment. By employing a multi-modal policy network that determines the optimal modalities on-the-fly, the approach achieves significant computational savings (35%-55% reduction) while maintaining or improving accuracy compared to traditional methods. Extensive experiments demonstrate AdaMML's effectiveness across various video recognition benchmarks, highlighting its potential for real-world applications.

Uploaded by

sunyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Rameswar Panda1,† , Chun-Fu (Richard) Chen1,† , Quanfu Fan1 , Ximeng Sun2 ,


Kate Saenko1,2 , Aude Oliva1,3 , Rogerio Feris1
†: Equal Contribution
1
MIT-IBM Watson AI Lab, 2 Boston University, 3 MIT

Abstract tation than others and hence selecting the cheaper modality
with good performance can significantly save computation
Multi-modal learning, which focuses on utilizing vari- leading to more efficient video recognition.
ous modalities to improve the performance of a model, is Let us consider the video in Figure 1, represented by eight
widely used in video recognition. While traditional multi- uniformly sampled video segments from a video. We ask,
modal learning offers excellent recognition results, its com- Do all the segments require both RGB and audio stream to
putational expense limits its impact for many real-world recognize the action as “Mowing the Lawn” in this video?
applications. In this paper, we propose an adaptive multi- The answer is clear: No, the lawn mower is moving with
modal learning framework, called AdaMML, that selects relevant audio only in the third and sixth segment, therefore
on-the-fly the optimal modalities for each segment condi- we need both RGB and audio streams for these two video
tioned on the input for efficient video recognition. Specif- segments to improve the model confidence for recognizing
ically, given a video segment, a multi-modal policy net- the correct action, while the rest of the segments can be
work is used to decide what modalities should be used for processed with only one modality or even skipped (e.g., the
processing by the recognition model, with the goal of im- first and last video segment) without losing any accuracy,
proving both accuracy and efficiency. We efficiently train resulting in large computational savings compared to pro-
the policy network jointly with the recognition model us- cessing all the segments using both modalities. Thus, in
ing standard back-propagation. Extensive experiments on contrast to the commonly used one-size-fits-all scheme for
four challenging diverse datasets demonstrate that our pro- multi-modal learning, we would like these decisions to be
posed adaptive approach yields 35% − 55% reduction in made individually per input segment, leading to different
computation when compared to the traditional baseline amounts of computation for different videos. Based on this
that simply uses all the modalities irrespective of the in- intuition, we present a new perspective for efficient video
put, while also achieving consistent improvements in ac- recognition by adaptively selecting input modalities, on a
curacy over the state-of-the-art methods. Project page: per segment basis, for recognizing complex actions.
https://rpand002.github.io/adamml.html.
In this paper, we propose AdaMML, a novel and differen-
tiable approach to learn a decision policy that selects optimal
1. Introduction
modalities conditioned on the inputs for efficient video recog-
Videos are rich in multiple modalities: RGB frames, mo- nition. Specifically, our main idea is to learn a model (re-
tion (optical flow), and audio. As a result, multi-modal ferred to as the multi-modal policy network) that outputs the
learning which focuses on utilizing various modalities to posterior probabilities of all the binary decisions for using or
improve the performance of a video recognition model, has skipping each modality on a per segment basis. As these de-
attracted much attention in the recent years. Despite en- cision functions are discrete and non-differentiable, we rely
couraging progress, multi-modal learning becomes com- on an efficient Gumbel-Softmax sampling approach [23]
putationally impractical in real-world scenarios where the to learn the decision policy jointly with the network pa-
videos are untrimmed and span several minutes or even rameters through standard back-propagation, without resort-
hours. Given a long video, some modalities often provide ing to complex reinforcement learning as in [60, 61]. We
irrelevant/redundant information for the recognition of the design the objective function to achieve both competitive
action class. Thus, utilizing information from all the input performance and efficiency required for video recognition.
modalities may be counterproductive as informative modali- We demonstrate that adaptively selecting input modalities
ties are often overwhelmed by uninformative ones in long by a lightweight policy network yields not only significant
videos. Furthermore, some modalities require more compu- savings in computation (e.g., about 47.3% and 35.2% less

7576
Figure 1: A conceptual overview of our approach. Rather than processing both RGB and Audio modalities for all the segments, our
approach learns a policy to select the optimal modalities per input segment, that is needed to correctly recognize an action in a given video.
In the figure, the lawn mower is moving with relevant audio only in the third and sixth segment, therefore those segments could be processed
using both modalities, while the rest of the segments require only one modality (e.g., only audio is relevant for the fourth segment as the lawn
mower moves outside of the camera but its sound is still clear) or even skipped (e.g., both of the modalities are irrelevant in the first and the
last segment), without losing any accuracy. Note that our approach can be extended to any number of modalities as shown in experiments.

GFLOPS compared to a weighted fusion baseline that simply 35, 37]. Our approach is most related to the latter which
uses all the modalities, on Kinetics-Sounds [2] and Activi- focuses on conditional computation for videos and is agnos-
tyNet [6] respectively), but also consistent improvement in tic to the network architecture used for recognizing videos.
accuracy over the state-of-the-art methods. Representative methods typically use reinforcement learning
The main contributions of our work are as follows: (RL) policy gradients [61, 60] or audio [30, 17] to select
• We propose a novel and differentiable approach that relevant video frames. LiteEval [59] proposes a coarse-to-
automatically determines what modalities to use per fine framework that uses a binary gate for selecting either
segment per input for efficient video recognition. This coarse or fine features. Unlike existing works, our proposed
is in sharp contrast to current multi-modal learning approach focuses on the multi-modal nature of videos and
approaches that utilizes all the input modalities without adaptively selects the right modality per input instance for
considering their relevance to the video recognition. recognizing complex actions in long videos. Moreover, our
framework is fully differentiable, and thus is easier to train
• We efficiently train the multi-modal policy network than complex RL policy gradients [61, 60, 57].
jointly with the recognition model using standard back- Multi-Modal Learning. Multi-modal learning has been
propagation through Gumbel-Softmax sampling. studied from multiple perspectives, such as two stream net-
• We conduct extensive experiments on four video works that fuse decisions from multiple modalities for clas-
benchmarks (Kinetics-Sounds [2], ActivityNet [6], sification [41, 7, 26, 27, 3], and cross-modal learning that
FCVID [24] and Mini-Sports1M [25]) with different takes one modality as input and make prediction on the
multi-modal learning tasks (RGB + Audio, RGB + other modality [29, 2, 62, 1, 15, 42]. Recent work in [52]
Flow, and RGB + Flow + Audio) to demonstrate the su- addresses the problem of joint training in multi-modal net-
periority of our approach over state-of-the-art methods. works, without deciding which modality to focus for a given
input sample as in our current approach. Our proposed
2. Related Work AdaMML framework is also related to prior works in joint
appearance and motion modeling [43, 31, 10] that focuses
Efficient Video Recognition. Video recognition has been on combining RGB and optical flow streams. Design of
one of the most active research areas in computer vision different fusion schemes [38] through neural architecture
recently [8]. In the context of deep neural networks, it is search [64] is also another recent trend for multi-modal learn-
typically performed by either 2D-CNNs [25, 51, 12, 53, ing. In contrast, we propose an instance-specific general
12, 32, 63] or 3D-CNNs [48, 7, 20, 13]. While extensive framework for automatically selecting the right modality per
studies have been conducted in the last few years, limited segment for efficient video recognition.
efforts have been made towards efficient video recognition. Adaptive Computation. Many adaptive computation meth-
Specifically, methods for efficient recognition focus on either ods have been recently proposed with the goal of improving
designing new lightweight architectures (e.g., Tiny Video computational efficiency [4, 5, 50, 54, 18, 14, 33, 34]. While
Networks [39], channel-separated CNNs [49], and X3D [13]) BlockDrop [58] dynamically selects which layers to execute
or selecting salient frames/clips [61, 60, 30, 17, 57, 22, 34, per sample during inference, GaterNet [9] proposes a gating

7577
RGB Policy Network RGB Recognition
Network

FC
RGB Difference Optical Flow Subnet 1

LSTM
“Mowing

FC
Fusion the Lawn”

Audio Audio Subnet 2


Cross-Entropy

FC
Loss

Subnet 3
Time Efficiency Loss

Figure 2: Illustration of our approach. AdaMML consists of a lightweight policy network and a recognition network composed of different
sub-networks that are trained jointly (via late fusion with learnable weights) for recognizing videos. The policy network decides what
modalities to use on a per segment basis to achieve the best recognition accuracy and efficiency in video recognition. In training, policies are
sampled from a Gumbel-Softmax distribution, which allows us to optimize the policy network via backpropagation. During inference, an
input segment is first fed into the policy network and then selected modalities are routed to the recognition network to generate segment-level
predictions. Finally, the network averages all the segment-level predictions to obtain the video-level prediction. Best viewed in color.

network to learn channel-wise binary gates for the main net- During training, the policy network is jointly trained
work. Channel gating network [21] identifies regions in the with the recognition network using Gumbel-Softmax sam-
features that contribute less to the classification result, and pling [23]. At test time, first an input video segment is
skips the computation on a subset of the input channels for fed into the policy network, whose output decides the right
these ineffective regions. SpotTune [19] learns to adaptively modalities to use for the given segment, and then the se-
route information through fine-tuned or pre-trained layers lected input modalities are routed to the corresponding
for different tasks. Adaptive selection of different regions sub-networks in the recognition network to generate the
for fast object detection is presented in [36, 16]. While our segment-level predictions. Finally, the network averages all
approach is inspired by these methods, in this paper, our goal the segment-level predictions as the video-level prediction.
is to adaptively select optimal modalities per input instance Note that the additional computational cost incurred by the
to improve efficiency in video recognition. To the best of our lightweight policy network (MobileNetV2 [40] in our case)
knowledge, this is the first work on data-dependent selection is negligible in comparison to the recognition model.
of different modalities for efficient video recognition.
3.2. Learning Adaptive Multi-Modal Policy
3. Proposed Method Multi-Modal Policy Network. The policy network con-
tains a lightweight joint feature extractor and an LSTM mod-
Given a video V containing a sequence of seg-
ule for modeling the causality across different time steps in a
ments {s1 , s2 , · · · , sT } over K input modalities
video. Specifically, at the t-th time step, the LSTM takes in
{M1 , M2 , · · · , MK }, our goal is to seek an adap-
the joint feature ft of the current video segment st , previous
tive multi-modal selection policy that decides what input
hidden states ht−1 , cell outputs ot−1 to compute the current
modalities should be used for each segment in order to
hidden state ht and cell states ot :
improve the accuracy, while taking the computational
efficiency into account for video recognition. \label {eq:lstm} h_t, o_t = \text {LSTM}(f_t, h_{t-1}, o_{t-1}). (1)
3.1. Approach Overview Given the hidden state, the policy network estimates a policy
Figure 2 illustrates an overview of our approach. Treating distribution for each modality and samples binary decisions
the task of finding an optimal multi-modal selection pol- ut,k indicating whether to select modality k at time step t
icy as a search problem quickly becomes intractable as the (U = {ut,k }l≤T,k≤K ) via Gumbel-Softmax operation as
number of potential configurations grows exponentially with described next. Given the decisions, we forward the current
the number of video segments and modalities. Instead of segment to corresponding sub-networks to get a segment-
handcrafting the selections, we develop a policy network level prediction and average all segment-level predictions to
that contains a very lightweight joint feature extractor and an generate video-level prediction for an input video.
LSTM module to output a binary policy vector per segment Training using Gumbel-Softmax Sampling. AdaMML
per input, representing whether to keep or drop an input makes decisions about skipping or using each modality
modality for efficient multi-modal learning. per segment per input. However, the fact that the policy

7578
is discrete makes the network non-differentiable and there- feature extractor and LSTM used in the policy network re-
fore difficult to be optimized with standard backpropagation. spectively. θF C 1 , ..., θF C K represent the parameters of the
One way to solve this is to convert the optimization to a fully connected layers that generate policy logits from the
reinforcement learning problem and then derive the opti- LSTM hidden states and θΨ1 , ..., θΨK represent the parame-
mal parameters of the policy network with policy gradient ters of K sub-networks that are jointly trained for recogniz-
methods [55, 46]. However, RL policy gradients are often ing video. During training, we minimize the following loss
complex, unwieldy to train and require techniques to reduce to encourage both correct predictions as well as minimize
variance during training as well as it is slow to converge in the selection of modalities that require more computation.
many applications [58, 59, 23, 57]. As an alternative, in this
paper, we adopt Gumbel-Softmax sampling [23] to resolve
\begin {split} \label {eq:loss} \E _{(V,y)\sim \mathcal {D}_{train}}\left [-y\log (\mathcal {P}(V; \Theta )) + \sum \limits _{k=1}^{K} \lambda _k \mathcal {C}_k\right ], \\ \ \ \ \mathcal {C}_k = \left \{ \begin {array}{ll} (\dfrac {|U_k|_0}{C})^2 & \text {if correct} \\ \gamma & \text {otherwise} \end {array} \right . \end {split}
this non-differentiability and enable direct optimization of
the discrete policy in an efficient way. (4)
The Gumbel-Softmax trick [23] is a simple and effective
way to replace the original non-differentiable sample from
a discrete distribution with a differentiable sample from a
corresponding Gumbel-Softmax distribution. Specifically, where P(V ; Θ) and y represents the prediction and one-hot
at each time step t, we first generate the logits zk ∈ R2 (i.e, encoded ground truth label of the training video sample V
output scores of policy network for modality k) from hidden and λk represents the cost associated with processing k-th
states ht by a fully-connected layer zk = F C k (ht , θF C k ) modality. Uk represents the decision policy for k-th modality
for each modality and then use the Gumbel-Max trick [23] |Uk |0 2
to draw discrete samples from a categorical distribution as: and Ck = ( ) measures the fraction of segments that
C
selected modality k out of total C video segments; when a
\label {eq:gumbelmax} \hat {P}_k = \argmax _{i\in \{0,1\}} (\log z_{i,k}+G_{i,k}), \ \ \ \ \ k \in [1, ..., K] \vspace {-1mm} (2) correct prediction is produced. We penalize incorrect pre-
dictions with γ, which including λk controls the trade-off
where Gi,k = − log(− log Ui,k ) is a standard Gumbel dis- between efficiency and accuracy. We use these parameters
tribution with Ui,k sampled from a uniform i.i.d distribution to vary the operating point of our model, allowing different
U nif (0, 1). Due to non-differentiable property of arg max models to be trained depending on the target budget con-
operation in Equation 2, Gumbel-Softmax distribution [23] straint. While the first part of the Equation 4 represents
is thus used as a continuous relaxation to arg max. Accord- the standard cross-entropy loss to measure the classification
ingly, sampling from a Gumbel-Softmax distribution allows quality, the second part drives the network to learn a pol-
us to backpropagate from discrete samples to the policy net- icy that favors selection of modality that is computationally
work. We represent P̂k as a one-hot vector and then one-hot more efficient in recognizing videos (e.g., processing RGB
coding is relaxed to a real-valued vector Pk using softmax: frames requires more computation than the audio streams).

\label {eq:one} P_{i,k}=\frac {\exp ((\log z_{i,k}+G_{i,k})/\tau )}{\sum _{j\in \{0,1\}} \exp ((\log z_{j,k}+G_{j,k})/\tau )}, (3)
4. Experiments
In this section, we conduct extensive experiments on four
where i ∈ {0, 1}, k ∈ [1, ..., K], τ is a temperature pa- standard datasets to show that AdaMML outperforms many
rameter, which controls the discreteness of Pk , as lim Pk strong baselines including state-of-the-art methods while
τ →+∞ significantly reducing computation and qualitative analysis
converges to a uniform distribution and lim Pk becomes a to verify the effectiveness of our adaptive policy learning.
τ →0
one-hot vector. More specifically, when τ becomes closer
to 0, the samples from the Gumbel Softmax distribution be- 4.1. Experimental Setup
come indistinguishable from the discrete distribution (i.e, Datasets and Tasks. We evaluate the performance of our
almost the same as the one-hot vector). In summary, during approach using four datasets, namely Kinetics-Sounds [2],
the forward pass, we sample the policy using Equation 2 ActivityNet-v1.3 [6], FCVID [24], and Mini-Sports1M [25].
and during the backward pass, we approximate the gradient Kinetics-Sounds is a subset of Kinetics [7] and consists of
of the discrete samples by computing the gradient of the 22, 521 videos for training and 1, 532 videos testing across
continuous softmax relaxation in Equation 3. 31 action classes [17]. ActivityNet contains 10, 024 videos
for training and 4, 926 videos for validation across 200 ac-
3.3. Loss Function
tion categories. FCVID has 45, 611 videos for training and
Let Θ = {θΦ , θLST M , θF C 1 , ..., θF C K , θΨ1 , ..., θΨK } 45, 612 videos for testing across 239 classes. Mini-Sports1M
denote the total trainable parameters in our framework, (assembled by [17]) is a subset of full Sports1M dataset [25]
where θΦ and θLST M represent the parameters of the joint containing 30 videos per class in training and 10 videos per

7579
Dataset Kinetics-Sounds ActivityNet
Selection Rate (%) Selection Rate (%)
Method Acc. (%) RGB Audio GFLOPs mAP (%) RGB Audio GFLOPs
RGB 82.85 100 − 141.36 73.24 100 − 141.36
Audio 65.49 − 100 3.82 13.88 − 100 3.82
Weighted Fusion 87.86 100 100 145.17 72.88 100 100 145.17
AdaMML 88.17 46.47 94.15 76.45 (-47.3%) 73.91 76.25 56.35 94.01 (-35.2%)
Table 1: Video recognition results with RGB + Audio modalities on Kinetics-Sounds and ActivityNet. On both datasets, our proposed
approach AdaMML outperforms the weighted fusion baseline while offering significant computational savings (shown in blue).

Selection Rate (%) Implementation Details. For the recognition network, we


Method Acc. (%) RGB Flow GFLOPs use TSN-like ResNet-50 [51] for both RGB and Flow modal-
RGB 82.85 100 − 141.36 ities, and MobileNetV2 [40] for the audio modality. We
Flow 75.73 − 100 163.39
simply apply late-fusion with learnable weights over the pre-
Weighted Fusion 83.47 100 100 304.75 dictions from each modality to obtain the final prediction.
AdaMML-Flow 83.82 56.04 36.39 151.54 (-50.3%)
AdaMML-RGBDiff 84.36 44.61 37.40 137.03 (-55.0%) We use MobileNetV2 for all modalities in the policy network
to extract features and then apply two additional FC layers
Table 2: RGB + Flow on Kinetics-Sounds. AdaMML-RGBDiff with dimension 2, 048 to concatenate the features from all
obtains best performance with more than 50% savings in GFLOPs. modalities as the joint-feature. The hidden dimension of
LSTM is set to 256. We use K parallel FC layers on top
Selection Rate (%)
Method Acc. (%) RGB Flow Audio GFLOPs of LSTM outputs to generate the binary decision policy for
RGB 82.85 100 − − 141.36
each modality. The computational cost for processing RGB
Flow 75.73 − 100 − 163.39 + Audio in the policy network and the recognition network
Audio 65.49 − − 100 3.82 are 0.76 and 14.52 GFLOPs, respectively.
Weighted Fusion 88.25 100 100 100 308.56
AdaMML-Flow 88.54 56.13 20.31 97.49 132.94 (-56.9%)
Training Details. During policy learning, we observe that
AdaMML-RGBDiff 89.06 55.06 26.82 95.12 141.97 (-54.0%) optimizing for both accuracy and efficiency is not effective
with a randomly initialized policy. Thus, we fix the policy
Table 3: RGB + Flow + Audio on Kinetics-Sounds. AdaMML-
network and “warm up” the recognition network using the
RGBDiff obtains the best accuracy of 89.06% which is 6.21%
more than RGB only performance with similar GFLOPS.
unimodality models (trained with ImageNet weights) for 5
epochs to provide a good starting point for policy learning.
We then alternatively train both policy and recognition net-
class in testing with a total of 487 action classes. We con- works for 20 epochs and then fine-tune the recognition net-
sider three groups of multi-modal learning tasks such as (I) work with a fixed policy network for another 10 epochs. We
RGB + Audio, (II) RGB + Flow, and (III) RGB + Flow + use same initialization and total number of training epochs
Audio on different datasets. More details about the datasets for all the baselines (including our approach) for a fair com-
can be found in the supplementary material. parison. We use 5 segments from a video during training in
Data Inputs. For each input segment, we take around 1- all our experiments (C set to 5). We use Adam [28] for the
second of data and temporally align all the modalities. For policy network and SGD [45] for the recognition network
RGB, we uniformly sample 8 frames out of 32 consecutive following [56, 44]. We set the initial temperature τ to 5, and
frames (8×224×224); and for optical flow, we stack 10 inter- gradually anneal down to 0 during the training, as in [23].
leaved horizontal and vertical optical flow frames [51]. For Furthermore, at test time, we use the same temperature τ that
audio, we use a 1-channel audio-spectrogram as input [26] corresponded to the training epoch in the annealing schedule.
(256 × 256, which is 1.28 seconds audio segment). Note The weight decay is set to 0.0001 and momentum in SGD is
that since computing optical flow is very expensive, we uti- 0.9. λk is set to the ratio of the computational load between
lize RGB frame difference as a proxy to flow in our policy modalities and γ is 10. More implementation details are
network and compute flow when needed. For RGB frame included in the supplementary material.
difference, we follow similar approach used in optical flow Baselines. We compare our approach with the following
and use an input clip 15 × 8 × 224 × 224 by simply comput- baselines and existing approaches. First, we consider uni-
ing the frame differences. For the policy network, we further modality baselines where we train recognition models using
subsample the input data for non-audio modality, e.g., the each modality separately. Second, we compare with a joint
RGB input becomes 4 × 160 × 160. training baseline, denoted as “Weighted Fusion”, that simply

7580
ActivityNet FCVID Network
Method RGB Audio mAP (%) GFLOPs
Method mAP (%) GFLOPs mAP (%) GFLOPs ListenToLook ResNet-18 ResNet-18 76.61 112.65
FrameGlimpse 60.14 33.33 67.55 30.10 AdaMML 112×112 ResNet-18 ResNet-18 79.48 70.87
AdaMML 224×224 ResNet-18 ResNet-18 80.05 82.33
FastForward 54.64 17.86 71.21 66.11
AdaFrame 71.5 78.69 80.2 75.13 AdaMML 224×224 ResNet-50 MobileNetV2 84.73 110.14
AdaMML 224×224 EfficientNet-b3 EfficientNet-b0 85.62 30.55
LiteEval 72.7 95.1 80.0 94.3
AdaMML 73.91 94.01 85.82 93.86
Table 6: Comparison with ListenToLook [17] on ActivityNet.
Table 4: Comparison with state-of-the-art methods on Activi- AdaMML outperforms ListenToLook by 3.44% in mAP while of-
tyNet and FCVID. AdaMML outperforms LiteEval [59] in terms fering 26.9% computational savings in terms of GFLOPs.
of accuracy (∼1%–5%) with similar computation on both datasets.

while offering 47.3% and 35.2% reduction in GFLOPs, on


Kinetics-Sounds Mini-Sports1M
Kinetics-Sounds and ActivityNet, respectively. Interestingly
Method Acc. (%) GFLOPs mAP (%) GFLOPs on ActivityNet, while performance of the weighted fusion
LiteEval 72.02 104.06 43.64 151.83 baseline is worse than the best single stream model (i.e.,
AdaMML 88.17 76.45 46.08 138.32 RGB only), our approach outperforms the best single stream
model on both datasets by adaptively selecting input modali-
Table 5: Comparison with LiteEval [59] on Kinetics-Sounds ties that are relevant for the recognition of the action class.
and Mini-Sports1M. AdaMML outperforms LiteEval by a signifi- Table 2 and Table 3 show the results of RGB + Flow and
cant margin in both accuracy and GFLOPs on both datasets.
RGB + Flow + Audio combinations on the Kinetics-Sounds.
Overall, AdaMML-Flow (which uses optical flow in policy
uses all the modalities (instead of selecting optimal modali- network) outperforms the joint training baseline while offer-
ties per input) via late fusion with learnable weights. This ing 50.3% (304.75 vs 151.54) and 56.9% (308.56 vs 132.94)
serves as a very strong baseline for classification, at the cost reduction in GFLOPs on RGB + Flow and RGB + Flow +
of heavy computation. Finally, we compare our method Audio combinations, respectively. AdaMML-RGBDiff (that
with existing efficient video recognition approaches, includ- uses RGBDiff in policy learning) achieves similar perfor-
ing FrameGlimpse [61], FastForward [11], AdaFrame [60], mance compared to AdaMML-Flow while alleviating com-
LiteEval [59] and ListenToLook [17]. We directly quote the putational overhead of computing optical flow (for irrelevant
numbers reported in published papers when possible and video segments), which shows that RGBDiff is in fact a good
use author’s provided source codes for LiteEval on both proxy for predicting on-demand flow computation during
Kinetics-Sounds and Mini-Sports1M datasets. test time. In summary, our consistent improvements in ac-
Evaluation Metrics. We compute either video-level mAP curacy over the weighted fusion baseline with 35% − 55%
(mean average precision) or top-1 accuracy (average predic- computational savings, shows the importance of adaptive
tions of 10 224 × 224 center-cropped and uniformly sampled modality selection for efficient video recognition.
segments) to measure the overall performance of different Comparison with State-of-the-art Methods. Table 4
methods. We also report the average selection rate, com- shows that AdaMML outperforms all the compared methods
puted as the percentage of total segments within a modality to achieve the best performance of 73.91% and 85.82% in
that are selected by the policy network in the test set, to show mAP on ActivityNet and FCVID respectively. Our approach
adaptive modality selection in our proposed approach. We achieves 1.21% and 5.82% mAP improvement over LiteE-
measure computational cost with giga floating-point opera- val [59] with similar GFLOPs on ActivityNet and FCVID
tions (GFLOPs), which is a hardware independent metric. respectively. Moreover, AdaMML (tested using 5 segments)
outperforms LiteEval by 2.70% (80.0 vs 82.70) in mAP,
4.2. Main Results while saving 39.2% in GFLOPs (94.3 vs 57.3) on FCVID.
Comparison with Weighted Fusion Baseline. We first Table 5 further shows that AdaMML significantly out-
compare AdaMML with the unimodality and weighted fu- performs LiteEval by 16.15% and 2.44%, while reducing
sion baseline on Kinetics-Sounds and ActivityNet dataset GFLOPS by 26.5% and 8.6%, on Kinetics-Sounds and Mini-
under different task combinations (Table 1-3). Note that our Sports1M respectively. In summary, AdaMML is clearly
approach is not entirely focused on accuracy. In fact, our better than LiteEval in terms of both accuracy and compu-
main objective is to achieve both competitive performance tational cost on all datasets, making it suitable for efficient
and efficiency required for video recognition. As for efficient recognition. Note that FrameGlimpse [61], FastForward [11]
recognition, it is very challenging to achieve improvements and AdaFrame [60] have less computation as they require
in both accuracy and efficiency. However, as shown in Ta- access to future frames unlike LiteEval and AdaMML that
ble 1, AdaMML outperforms the weighted fusion baseline makes decision based on the current time stamp only.

7581
RGB + Audio RGB + Flow RGB + Flow + Audio
AcitivityNet Kinetics-Sounds
Method Acc. (%) GFLOPs Acc. (%) GFLOPs Acc. (%) GFLOPs
Average Fusion 88.15 145.17 83.30 304.75 88.18 308.56 Method mAP (%) Acc. (%)
Class-wise Weighted Fusion 87.86 145.17 83.82 304.75 87.75 308.56
Max Fusion 86.49 145.17 83.47 304.75 88.06 308.56
FC2 Fusion∗ 87.73 145.17 83.30 304.75 87.84 308.56 Random (Train) 70.34 84.34
Weighted Fusion
AdaMML
87.86
88.17
145.17
76.45
83.47
84.36
304.75
137.03
88.25
89.06
308.56
141.97
Random (Test) 58.31 85.75
∗: concatenating feature vectors from all modalities and add two addition fully-connected layers to fuse features. Random (Train + Test) 70.85 86.28
AdaMML 73.91 88.17
Table 7: Comparison with fusion strategies on Kinetics-
Sounds. AdaMML consistently outperforms hand-designed fusion Table 8: Comparison with random policy on RGB + Audio.
strategies with overall 50% − 60% computational savings. Random (X) denotes random selection of modalities during X-
phase of the learning. AdaMML outperforms all the variants show-
In addition, we also compare with ListenToLook [17] ing effectiveness of learned policy in video recognition.
that uses both RGB and Audio to eliminate video redundan-
cies. As ListenToLook utilizes weight distillation from Ki-
time increases the computation as it favors selection of more
netics400 pretrained model, we use Kinetics400 pretrained
RGB stream. On the other hand, AdaMML selects compar-
weights instead of ImageNet weights to initialize our uni-
atively less RGB stream and focuses more on the cheaper
modality models for a comparison on ActivityNet in Table 6.
audio stream as many actions can be recognized by only
With the same network architecture (ResNet18) and frame
audio without looking into the RGB frames.
resolution (112 × 112), AdaMML outperforms ListenToLook
by a margin of 2.87% in mAP while using 37.1% less com- Comparison with Random Policy. We perform three dif-
putation. This once again shows that our proposed approach ferent experiments by randomly selecting a modality with
of adaptively selecting right modalities on a per segment ba- 50% probability during both training and testing. Table 8
sis is able to yield not only significant savings in computation shows that our approach AdaMML outperforms all the three
but also improvement in accuracy. To show that the benefits variants by a large margin (e.g., 15.60% and 2.42% im-
of our method extend even to more recent and efficient net- provement over Random (Test) on ActivityNet and Kinetics-
works, we use EfficientNet [47] in our approach and observe Sounds respectively) which demonstrates the effectiveness
that it provides the best recognition performance of 85.62% of our learned policy in selecting the optimal modalities
in mAP with only 30.55 GFLOPs (∼73% less computation per input instance while recognizing videos. We also com-
compared to AdaMML (ResNet50 — MobileNetV2)). pare these variants using the same selection rate as ours and
AdaMML still outperforms them on ActivityNet and Kinetics-
4.3. Ablation Studies Sounds (e.g., 8.18% and 2.14% increase over Random (Test)
Comparison with Additional Fusion Strategies. We and 2.23% and 1.55% increase over Random (Train + Test)).
compare with four additional fusion strategies including Ablation on Training Losses. As discussed in Section 3.3,
weighted fusion on different combinations of modalities. λk and γ controls the trade-off between accuracy and com-
Table 7 shows that our approach AdaMML consistently out- putational efficiency. We investigate the effect of efficiency
performs all the hand-designed fusion strategies while of- loss in RGB + Audio experiment on Kinetics-Sounds and
fering 47.3%, 55.03% and 53.99% reduction in GFLOPs observe that training without efficiency loss (both λk and γ
on RGB + Audio, RGB + Flow and RGB + Flow + Audio set to 0) achieves a video accuracy of 88.82% (an improve-
combinations on Kinetics-Sounds respectively. Furthermore, ment of 0.65%) while requiring 47.3% more computation
AdaMML with RGBDiff as the proxy alleviates the compu- than AdaMML that uses efficiency loss during training. Simi-
tational overhead of computing optical flow (which is often larly, using equal cost weights for both modalities (by setting
very expensive) making it suitable in online scenarios. Sim- λrgb =λaudio =1) achieves an accuracy of 86.82% compared
ilarly, AdaMML offers 19.2% computational savings while to 88.17% using AdaMML, with very less utilization of audio
outperforming these fusion strategies by a margin of about (only 39.13% in contrast to 94.15% using our approach). As
2% in mAP on RGB + Audio combination on ActivityNet. processing audio stream is much cheaper, we use λrgb = 1
Policy Design. We investigate the effectiveness of our de- and λaudio = 0.05 to favor selection of cheaper modalities
sign of policy by either selecting or skipping both modalities and achieves an accuracy of 88.17% with 76.45 GFLOPs on
at same time instead of taking the decisions per modality. Kinetics-Sounds. We further test the effect of penalty factor
In other words, we use a single FC layer in the policy net- γ in Equation 4 by varying it from [0, 2, 5, 10] and observe
work which outputs the binary decisions where 1 indicate that it has little effect on the final performance with the best
the use of both modalities and 0 indicate skipping of both performance at γ = 10 in all our experiments.
modalities in our framework. AdaMML outperforms the alter- Effectiveness of LSTM. We investigate the effectiveness of
native design (88.02 vs 88.17) while saving 18% GFLOPS LSTM in modeling video causality on the RGB + Audio
on Kinetics-Sounds. Selection of both modalities at the same experiment and observe that directly predicting a choice

7582
Figure 3: Qualitative examples showing the effectiveness of AdaMML in selecting the right modalities per video segment (marked
by green borders). (a, b) RGB + Audio: AdaMML selects RGB stream for second and third segments in (a) while skips irrelevant audio
coming from the reporter and background song. Similarly in (b), it is able to select RGB modality for only one segment while selecting the
entire audio stream as the action can be easily recognized with audio (Playing Piano). (c, d) RGB + Flow: Our approach selects flow stream
only when it is informative for the action, e.g., second and third segments in (c) and only second segment in (d). (e) RGB + Flow + Audio:
AdaMML selects audio for most of the segments (not for last two segments as the audio is not clear with the mixing of sound from both
instruments) while selecting flow only for the sixth segment where the motion related to the action is clearly visible. Best viewed in color.

via a single fully-connected layer (i.e., by removing LSTM “fencing” (majority of audio comes from the reporter and
from the policy network) decreases the video accuracy from background song). Similarly in Figure 3.(b), it is able to
88.17% to 86.82% on Kinetics-Sounds. This confirms that select RGB modality for only one segment while selecting
LSTM is critical for good performance as it makes the policy the entire audio stream as the action can be easily recog-
network aware of all useful information seen so far. nized with audio (“Playing Piano”). Overall, we observe that
Sampling Hyperparameters. We test the effect of temper- AdaMML focuses on the right modalities to use per segment
ature (Equation 3) in RGB+Audio experiment on Kinetics- for correctly classifying videos while taking efficiency into
Sounds dataset by varying it from [0, 0.5, 5, 10] and observe account (e.g., in Figure 3.(e), it mainly focus on audio for
that higher values (5, 10) show better performance (by 0.5%- most of the segments while selecting RGB only for two in-
0.7%) than lower ones. So, we start at a high temperature formative segments and flow stream for the sixth segment
(set to 5 in all our experiments) and anneal it to a small non- for recognizing the action “Playing Accordion”).
zero value, as in [23]. Similarly, we also vary the annealing
factor from [0, 0.5, 0.965] and notice that setting it to 0.965 5. Conclusion
leads to the best accuracy of 88.17% while 0 leads to an In this paper, we present AdaMML, a novel and differen-
accuracy of 87.40% on Kinetics-Sounds. tiable approach for adaptively determining what modalities
to use per segment per instance for efficient video recogni-
4.4. Qualitative Results
tion. In particular, we trained a multi-modal policy network
Figure 3 shows the selected modalities using our approach to predict these decisions with the goal of achieving both
on different cases ((a, b) RGB + Audio, (c, d) RGB + Flow, competitive accuracy and efficiency. We efficiently train
and (e) RGB + Flow + Audio). As seen from Figure 3.(a), the policy network jointly with the recognition model using
our approach is able to select RGB modality for the seg- standard back-propagation. We demonstrate the effective-
ments that are more informative of the action and skip the ness of our proposed approach on four standard datasets,
audio stream as audio in that video is irrelevant to the action outperforming several competing methods.

7583
References [18] Alex Graves. Adaptive computation time for recurrent neural
networks. arXiv preprint arXiv:1603.08983, 2016. 2
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret
[19] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grau-
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh.
man, Tajana Rosing, and Rogerio Feris. Spottune: transfer
Vqa: Visual question answering. In ICCV, 2015. 2
learning through adaptive fine-tuning. In CVPR, 2019. 3
[2] Relja Arandjelovic and Andrew Zisserman. Look, listen and
[20] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can
learn. In ICCV, 2017. 2, 4
spatiotemporal 3d cnns retrace the history of 2d cnns and
[3] John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and
imagenet? In CVPR, 2018. 2
Fabio A González. Gated multimodal units for information
[21] Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang,
fusion. arXiv preprint arXiv:1702.01992, 2017. 2
and G Edward Suh. Channel gating neural networks. In
[4] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and
NeurIPS, 2019. 3
Doina Precup. Conditional computation in neural networks
[22] Noureldien Hussein, Mihir Jain, and Babak Ehteshami Be-
for faster models. arXiv preprint arXiv:1511.06297, 2015. 2
jnordi. Timegate: Conditional gating of segments in long-
[5] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Es-
range activities. arXiv preprint arXiv:2004.01808, 2020. 2
timating or propagating gradients through stochastic neurons
[23] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparam-
for conditional computation. arXiv preprint arXiv:1308.3432,
eterization with gumbel-softmax. In ICLR, 2017. 1, 3, 4, 5,
2013. 2
8
[6] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem,
and Juan Carlos Niebles. Activitynet: A large-scale video [24] Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and
benchmark for human activity understanding. In CVPR, 2015. Shih-Fu Chang. Exploiting feature and class relationships in
2, 4 video categorization with regularized deep neural networks.
TPAMI, 2017. 2, 4
[7] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In CVPR, [25] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
2017. 2, 4 Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
classification with convolutional neural networks. In CVPR,
[8] Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakr-
2014. 2, 4
ishnan, Rogerio Feris, John Cohn, Aude Oliva, and Quanfu
Fan. Deep analysis of cnn-based spatio-temporal representa- [26] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and
tions for action recognition. In CVPR, 2021. 2 Dima Damen. Epic-fusion: Audio-visual temporal binding
[9] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. You look for egocentric action recognition. In ICCV, 2019. 2, 5
twice: Gaternet for dynamic filter selection in cnns. In CVPR, [27] Douwe Kiela, Edouard Grave, Armand Joulin, and Tomas
2019. 2 Mikolov. Efficient large-scale multi-modal classification. In
[10] Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and AAAI, 2018. 2
Cordelia Schmid. Mars: Motion-augmented rgb stream for [28] Diederik P Kingma and Jimmy Ba. Adam: A method for
action recognition. In CVPR, 2019. 2 stochastic optimization. In ICLR, 2015. 5
[11] Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, [29] Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera-
Jianjun Ge, and Yi Yang. Watching a small portion could be tive learning of audio and video models from self-supervised
as good as watching all: Towards efficient video classification. synchronization. In NeurIPS, 2018. 2
In IJCAI, 2018. 6 [30] Bruno Korbar, Du Tran, and Lorenzo Torresani. Scsampler:
[12] Quanfu Fan, Chun-Fu Richard Chen, Hilde Kuehne, Marco Sampling salient clips from video for efficient action recogni-
Pistoia, and David Cox. More is less: Learning efficient video tion. In ICCV, 2019. 2
representations by big-little network and depthwise temporal [31] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park,
aggregation. In NeurIPS, 2019. 2 and Nojun Kwak. Motion feature network: Fixed motion
[13] Christoph Feichtenhofer. X3d: Expanding architectures for filter for action recognition. In ECCV, 2018. 2
efficient video recognition. In CVPR, 2020. 2 [32] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
[14] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, module for efficient video understanding. In ICCV, 2019. 2
Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. [33] Mason McGill and Pietro Perona. Deciding how to decide:
Spatially adaptive computation time for residual networks. In Dynamic routing in artificial neural networks. In ICML, 2017.
CVPR, 2017. 2 2
[15] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, [34] Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De- Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and
vise: A deep visual-semantic embedding model. In NeurIPS, Rogerio Feris. Ar-net: Adaptive frame resolution for efficient
2013. 2 action recognition. In ECCV, 2020. 2
[16] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S [35] Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna
Davis. Dynamic zoom-in network for fast object detection in Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and
large images. In CVPR, 2018. 3 Rogerio Feris. Adafuse: Adaptive temporal fusion network
[17] Gao, Ruohan and Oh, Tae-Hyun, and Grauman, Kristen and for efficient action recognition. In ICLR, 2021. 2
Torresani, Lorenzo. Listen to look: Action recognition by [36] Mahyar Najibi, Bharat Singh, and Larry S Davis. Autofocus:
previewing audio. In CVPR, 2020. 2, 4, 6, 7 Efficient multi-scale inference. In ICCV, 2019. 3

7584
[37] Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-
Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, vnet design via differentiable neural architecture search. In
and Rogerio Feris. Va-red2: Video adaptive redundancy CVPR, 2019. 5
reduction. In ICLR, 2021. 2 [57] Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and
[38] Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Shilei Wen. Multi-agent reinforcement learning based frame
Moez Baccouche, and Frédéric Jurie. Mfas: Multimodal sampling for effective untrimmed video recognition. In ICCV,
fusion architecture search. In CVPR, 2019. 2 2019. 2, 4
[39] AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. Tiny [58] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven
video networks. arXiv preprint arXiv:1910.06961, 2019. 2 Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris.
[40] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- Blockdrop: Dynamic inference paths in residual networks. In
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted CVPR, 2018. 2, 4
residuals and linear bottlenecks. In CVPR, 2018. 3, 5 [59] Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S
[41] Karen Simonyan and Andrew Zisserman. Two-stream convo- Davis. Liteeval: A coarse-to-fine framework for resource
lutional networks for action recognition in videos. In NeurIPS, efficient video recognition. In NeurIPS, 2019. 2, 4, 6
pages 568–576, 2014. 2 [60] Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher,
[42] Richard Socher, Milind Ganjoo, Christopher D Manning, and and Larry S Davis. Adaframe: Adaptive frame selection for
Andrew Ng. Zero-shot learning through cross-modal transfer. fast video recognition. In CVPR, 2019. 1, 2, 6
In NeurIPS, 2013. 2 [61] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei.
[43] Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, End-to-end learning of action detection from frame glimpses
and Wei Zhang. Optical flow guided feature: A fast and robust in videos. In CVPR, 2016. 1, 2, 6
motion representation for video action recognition. In CVPR, [62] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Von-
2018. 2 drick, Josh McDermott, and Antonio Torralba. The sound of
[44] Ximeng Sun, Rameswar Panda, and Rogerio Feris. Adashare: pixels. In ECCV, 2018. 2
Learning what to share for efficient deep multi-task learning. [63] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
In NeurIPS, 2020. 5 ralba. Temporal relational reasoning in videos. In ECCV,
[45] Ilya Sutskever, James Martens, George Dahl, and Geoffrey 2018. 2
Hinton. On the importance of initialization and momentum [64] Barret Zoph and Quoc V Le. Neural architecture search with
in deep learning. In ICML, 2013. 5 reinforcement learning. In ICLR, 2017. 2
[46] Richard S Sutton and Andrew G Barto. Reinforcement learn-
ing: An introduction. MIT press, 2018. 4
[47] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model
scaling for convolutional neural networks. In ICML, 2019. 7
[48] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In ICCV, 2015. 2
[49] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli.
Video classification with channel-separated convolutional net-
works. In ICCV, 2019. 2
[50] Andreas Veit and Serge Belongie. Convolutional networks
with adaptive inference graphs. In ECCV, 2018. 2
[51] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin,
Xiaoou Tang, and Luc Van Gool. Temporal segment networks:
Towards good practices for deep action recognition. In ECCV,
2016. 2, 5
[52] Weiyao Wang, Du Tran, and Matt Feiszli. What makes train-
ing multi-modal networks hard? In CVPR, 2020. 2
[53] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming
He. Non-local neural networks. In CVPR, 2018. 2
[54] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E
Gonzalez. Skipnet: Learning dynamic routing in convolu-
tional networks. In ECCV, 2018. 2
[55] Ronald J Williams. Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
learning, 1992. 4
[56] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,
Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing

7585

You might also like