0% found this document useful (0 votes)

17 views13 pages

Event-Based Monocular Depth Estimation With Recurrent Transformers

This paper presents EReFormer, a novel event-based monocular depth estimator that employs recurrent transformers to effectively model global spatial information and long-range temporal dependencies from event streams. The proposed architecture includes a transformer-based encoder-decoder, a Gate Recurrent Vision Transformer (GRViT) for enhanced temporal modeling, and a Cross Attention-guided Skip Connection (CASC) for improved spatial modeling. Experimental results demonstrate that EReFormer significantly outperforms existing state-of-the-art methods in generating high-quality dense depth maps, particularly in challenging conditions such as low light and motion blur.

Uploaded by

張傑

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Event-Based Monocular Depth Estimation With Recurrent Transformers

Uploaded by

張傑

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO.

8, AUGUST 2024 7417

Event-Based Monocular Depth Estimation

With Recurrent Transformers
Xu Liu , Jianing Li , Member, IEEE, Jinqiao Shi , Xiaopeng Fan , Senior Member, IEEE,
Yonghong Tian , Fellow, IEEE, and Debin Zhao , Member, IEEE

Abstract— Event cameras, offering high temporal resolutions

and high dynamic ranges, have brought a new perspective
to address common challenges in monocular depth estimation
(e.g., motion blur and low light). However, existing CNN-based
methods insufficiently exploit global spatial information from
asynchronous events, while RNN-based methods show a limited
capacity for effective temporal cues utilization for event-based
monocular depth estimation. To this end, we propose a event-
based monocular depth estimator with recurrent transformers,
namely EReFormer. Technically, we first design a transformer-
based encoder-decoder that utilizes multi-scale features to model
global spatial information from events. Then, we propose a Gate
Recurrent Vision Transformer (GRViT), introducing a recursive
mechanism into transformers, to leverage rich temporal cues
from events. Finally, we present a Cross Attention-guided Skip
Connection (CASC), performing cross attention to fuse multi-
scale features, to improve global spatial modeling capabilities. Fig. 1. Monocular depth estimation in challenging condition. (a) Feed-for-
The experimental results show that our EReFormer outperforms ward frame-based monocular depth estimators [6], [20], [21] fail to generate
state-of-the-art methods by a margin on both synthetic and a high-quality depth map by processing each image with low light. (b) Our
real-world datasets. Our open-source code is available at EReFormer is a pure transformer with a recursive mechanism, which can
https://github.com/liuxu0303/EReFormer. convert continuous event stream into high-quality dense depth maps via
modeling global spatial information and leveraging rich temporal cues.
Index Terms— Event camera, monocular depth estimator,
recurrent transformer, cross attention.

I. I NTRODUCTION conditions (e.g., motion blur and low light) [5], [6]. Recently,
event cameras [7], [8], offering high temporal resolutions and
M ONOCULAR depth estimation [1], [2], [3], [4] is
one of the critical and challenging topics, supporting
widespread vision applications in a low-cost effective manner.
high dynamic ranges, have been attempted to address these
common challenges [9], [10], [11], [12], [13], [14], [15],
In fact, conventional frame-based cameras have presented [16], [17], [18], [19]. However, a key question remains: How
some shortcomings for depth estimation in challenging to effectively exploit the global spatial information and rich
temporal cues from asynchronous sparse events to generate
Manuscript received 21 March 2023; revised 24 July 2023 and 1 November dense depth maps?
2023; accepted 5 March 2024. Date of publication 18 March 2024; date of
current version 12 August 2024. This work was supported in part by the For spatial modeling, the mainstream event-based monocu-
National Key Research and Development Program of China under Grant lar depth estimators [10], [11], [13], [22] adopt CNN-based
2021YFF0900500; and in part by the National Natural Science Foundation of architectures. For instance, Zhu et al. [11] design an
China (NSFC) under Grant U22B2035, Grant 62272128, Grant 62027804,
and Grant 62088102. This article was recommended by Associate Editor unsupervised CNN-based encoder-decoder network for semi-
Z. Li. (Corresponding author: Xiaopeng Fan.) dense depth estimation. Further, the following works [10],
Xu Liu, Xiaopeng Fan, and Debin Zhao are with the Research Center [13], [22] present supervised training frameworks to generate
of Intelligent Interface and Human Computer Interaction, Department of
Computer Science and Technology, Harbin Institute of Technology, Harbin dense depth maps based on UNet [23]. Although these
150001, China, and also with the Peng Cheng Laboratory, Shenzhen 518000, CNN-based learning methods achieve better performance
China (e-mail: [email protected]; [email protected]; [email protected]). than the model-based optimized approaches [14], [15],
Jianing Li is with the School of Computer Science, Peking University,
Beijing 100871, China (e-mail: [email protected]). [16], [19], they are not capable of utilizing the global
Jinqiao Shi is with the School of Cyberspace Security, Beijing spatial information from asynchronous sparse events due
University of Posts and Telecommunications, Beijing 100871, China (e-mail: to the essential locality of convolution operations. For
[email protected]).
Yonghong Tian is with the School of Computer Science, Peking University, temporal modeling, most existing event-based monocular
Beijing 100871, China, and also with the Peng Cheng Laboratory, Shenzhen depth estimators [10], [22] introduce RNN-based architectures.
518000, China (e-mail: [email protected]). More specifically, the lightweight recurrent convolutional
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2024.3378742. architectures (e.g., ConvLSTM [24] and ConvGRU [25]) are
Digital Object Identifier 10.1109/TCSVT.2024.3378742 attempted to incorporate into UNet [23] for modeling long-
1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

range temporal dependencies. However, these RNN-based are provided in Section V, while some discussions are reported
architectures essentially still use convolution operations to in Section VI. Finally, we conclude the paper in Section VII.
interact spatial and temporal information, showing a limited
capacity for effective temporal modeling. More recently,
transformers [26], [27], [28] demonstrate appealing potential II. R ELATED W ORK
in modeling global spatial context information for frame- A. Event-Based Monocular Depth Estimation
based monocular depth estimation tasks [21], [29], [30], [31].
Not only that, transformers can also effectively establish the Event cameras for monocular depth estimation have become
interaction between spatial and temporal domains via the increasingly popular in robot navigation [7], [36], [37],
self-attention mechanism, they have demonstrated impressive especially involving low-latency obstacle avoidance and high-
performance in temporal sequence tasks [32], [33], [34]. speed path planning. Early model-based works [14], [15],
To this end, this paper proposes a event-based monoc- [16], [17], [19] usually calculate both camera poses and depth
ular depth estimator with recurrent transformers, namely maps via solving a non-linear optimization problem. Yet,
EReFormer, which is the first transformer-based architecture these model-based optimized methods need to obtain camera
with a recursive mechanism to process continuous event poses or auxiliary sensor parameters (e.g., IMU). Recently,
streams, as shown in Fig. 1. Our EReFormer is designed various learning-based methods [10], [11], [13], [22] have
to model global spatial information and long-range temporal been introduced to convert asynchronous events into depth
dependencies from event streams. More specifically, we first maps. Although these CNN-based methods achieve promising
design a transformer-based encoder-decoder backbone using results, they insufficiently exploiting global spatial information
swin transformer blocks [27] for event-based monocular and some of these feed-forward models [11], [13] have not
depth estimation, which utilizes multi-scale features to model yet used rich temporal cues from event streams. In addition,
global spatial information from events. Then, we propose a the lack of effective spatio-temporal information interactions
Gate Recurrent Vision Transformer (GRViT) to leverage rich in RNN-based backbones [10], [22] may limit performance
temporal cues from event streams. The core of GRViT is to improvements.
incorporate a recursive mechanism into Vision Transformer As illustrated in Table I, we make a comprehensive
(e.g., ViT [28]) so that it can model long-range temporal literature review on event-based monocular depth estimation.
dependencies. Finally, we present a Cross-Attention-guided The existing event-based monocular depth estimators can
Skip Connection (CASC) to improve global spatial modeling be broadly classified into two categories (i.e., model-based
capabilities in our EReFormer, which fuses multi-scale optimized methods [14], [15], [17], [19] and learning-based
features by performing cross-attention. The experimental methods [11], [13], [22]). Besides, the predicted density of
results demonstrate that our EReFormer outperforms state- depth models contains three types (i.e., sparse [14], [15], semi-
of-the-art methods by a large margin on both synthetic and dense [11], [19], and dense [10], [13], [17], [22]). The sparse
real-world datasets (i.e., DENSE [10] and MVSEC [35]). Our map refers to the depth at pixels when only events occurred,
EReFormer also verifies that event cameras can perform robust the semi-dense map denotes the depth at the reconstructed
monocular depth estimation even in cases where conventional edges of the image, and the dense map is the depth prediction
cameras fail, e.g., fast-motion and low-light scenarios. at all pixels.
In summary, the main contributions are as follows:
• We propose a novel transformer-based architecture (i.e.,
B. Transformer-Based Monocular Depth Estimation
EReFormer) for event-based monocular depth estimation,
which outperforms state-of-the-art methods in terms of Transformers are applied in frame-based monocular depth
depth map quality by a large margin. estimation tasks [21], [29], [38], [39], [40], [41], [42] by
• We design a gate recurrent vision transformer incorpo- integrating the self-attention mechanism or the full transformer
rating a recursive mechanism into transformers, which as a powerful module. For instance, DPT [21] first leverages
enhances temporal modeling capabilities for event vision transformers instead of CNN-based backbones for
streams while mitigating the costly GPU memory dense depth prediction tasks. Meanwhile, Swin-Depth [39]
requirement. proposes a transformer-based monocular depth estimation
• We present a cross-attention-guided skip connection, method that uses hierarchical representation learning with
which improves global spatial modeling capabilities via linear complexity for images. Subsequently, Depthformer [42]
performing cross-attention to fuse multi-scale features. presents a hybrid CNN-Transformer architecture consisting
To the best of our knowledge, this is the first work of a transformer branch to learn the long-range correlation
to explore such a recurrent transformer to generate dense and a convolution branch to extract the local information.
depth maps using a monocular event camera, which further In addition, some studies [29], [38], [40], [41] adopt
unveils the versatility and transferability of transformers from transformers for self-supervised monocular depth estimation.
conventional frames to continuous event streams. Although above works have achieved finer-grained and more
The rest of this paper is organized as follows. Section II globally coherent predictions than CNN-based methods, these
reviews prior work. Section III formulates the novel problem. transformer-based architectures operate on each isolated image
In Section IV, we explain the details of the proposed so that they do not directly process a continuous stream of
framework. The experimental results, ablations and analysis asynchronous events.

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7419

TABLE I
A L ITERATURE R EVIEW ON M ONOCULAR E VENT-BASED D EPTH E STIMATION

More recently, some event-based vision tasks (e.g., temporal information, it can be formulated as:
event representation [43], video reconstruction [44], event-
D = M(K(S1 ), . . . , K(ST )), (1)
based denoising [45], object tracking [46], and object
recognition [47]) have sought to design transformer-based where the proposed function M can leverage rich temporal
frameworks for better performance. For example, ET-Net [44] cues from event temporal bins, and the parameter T determines
introduces transformers into CNN for event-based video the length of utilizing temporal information.
reconstruction, which effectively models global context via Given the ground-truth depth maps D̄ = D̄1 , . . . , D̄T ,
the transformer-based module. Alkendi et al. [45] develop a we minimize the loss function between the predicted depth
hybrid GNN-Transformer model for event camera denoising. map Dt and the ground-truth D̄t as follows:
CTN [47] presents a hybrid CNN-Transformer network for
M̂ = arg min L M ( D, D̄) ≜ Et∈[1,T ] d Dt , D̄t ,

event-based data classification. However, there are only few (2)
M
explorations for event-based monocular depth estimation tasks.
In this paper, we propose EReFormer, which is a pure where E[·] is an empirical expectation function, d(·, ·) is a
transformer-based architecture to model global spatial context distance metric, e.g., scale-invariant loss.
information and long-range temporal dependencies for event-
based monocular depth estimation. IV. M ETHODOLOGY
This section first gives an overview of our framework. Then,
we present the details of three important components of the
III. P ROBLEM D EFINITION proposed framework: transformer-based encoder-decoder, gate
recurrent vision transformer, and cross-attention-guided skip
Event cameras, such as DVS [48] and DAVIS [49], are
connection. Finally, we give training details of our method.
bio-inspired vision sensors that respond to light changes with
continuous event streams. Each event en can be described
as a four-attribute tuple (xn , yn , tn , pn ). Consequently, asyn- A. Framework Overview
Ne
chronous events S = {en }n=1 are sparse and discrete points This work aims at designing an event-based monoc-
in the spatio-temporal window 0. In general, a continuous ular depth estimator with recurrent transformers, termed
event stream needs to be split into event temporal bins. EReFormer, which can generate high-quality dense depth
Obviously, the temporal correlation lies in adjacent event maps via modeling global spatial context information and
temporal bins [50]. However, the most existing event-based leveraging rich temporal cues. As shown in Fig. 2(a), our
monocular depth estimators [11], [13], running a feed-forward EReFormer mainly consists of three modules: transformer-
frame-based model independently on each event image [51] based encoder-decoder, gate recurrent vision transformer
or voxel grid [11], have not yet leveraged rich temporal cues. (GRViT) module, and cross attention-guided skip connection
In this work, we focus on the knowledge gap and formulate (CASC) module. More precisely, the event stream S is first
this challenging issue called event-based monocular depth split into event temporal bins {S1 , . . . , ST }, and each bin St is
estimation as follows. converted into a 2D image-like representation E t . To provide
Let {S1 , . . . , ST } be event temporal bins separated from a more compelling demonstration of our proposed framework,
a continuous event stream S, where St ∈ RW ×H ×1t is the we utilize event image representation [51] to encode each
t-th event temporal bin with the duration 1t = 50 ms. bin. This choice is motivated by its ease of implementation
To make asynchronous events compatible with deep learning and faster inference speed. In fact, our EReFormer framework
techniques [52], event temporal bins need to convert into event offers a generic interface, allowing for alternative event
embeddings E = {E 1 , . . . , E T } by a kernel function K, where representations to be used as well, providing flexibility and
E t ∈ RW ×H ×Ce is t-th event embedding with the channel adaptability to different scenarios. Then, the transformer-based
number Ce . The goal of our monocular depth estimator is encoder, utilizing swin transformer blocks [27], progressively
to learn a non-linear mapping function M to generate dense extract multi-scale features via the downsampling operation.
depth maps D = {D1 , . . . , DT } by exploiting the spatio- Meanwhile, the GRViT incorporates a recursive mechanism

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7420 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

into Vision Transformer (e.g., ViT [28]) to model long-range To be specific, our GRViT mainly consists of two core parts,
temporal dependencies, which can leverage rich temporal cues namely the attention gate and the update gate. A learnable
from event streams and alleviate the expensive GPU memory positional encoding vector needs to be appended to f t before
cost. To further improve global spatial modeling capabilities, inputting it into the GRViT. The attention gate is utilized to
the CASC is designed as a skip connection to fuse multi- generate the attention feature map At . Firstly, At is added
scale features. Finally, the corresponding decoder predicts to the input f t followed by a feed-forward network (FFN)
fine-grained and globally coherent depth maps {D1 , . . . , DT } with a residual connection, which is used to output the spatio-
using the hierarchical upsampling transformer blocks. temporal feature map fˆt . Secondly, At and h t−1 are passed by
the update gate and output the current hidden state h t .
B. Transformer-Based Encoder-Decoder The attention gate aims at establishing the interaction
between spatial and temporal domains from the current feature
Due to the sparse and discrete attributes of asynchronous
map and the previous hidden state. Firstly, the input of the
events, it is difficult to extract effective global spatial
attention gate is a triplet (i.e., Q t , K t , and Vt ), which can be
information from the local space using CNN-based models.
computed from f t and h t−1 as:
To overcome this challenge, we develop a transformer-based
f
encoder-decoder that models global spatial information from Q t = f t W Q + h t−1 W Qh
event streams for monocular depth estimation. f
K t = f t W K + h t−1 W Kh
1) Transformer Encoder: In order to enhance the global
f
information learning ability under different scale features, Vt = f t WV + h t−1 WVh , (4)
we exploit the widely-used Swin-T [27] as our backbone, f f f
which utilizes the hierarchical attention mechanism to where W Q , W K , WV , W Qh , W Kh , and WVh are learnable
extract features. Specifically, a 2D image-like representation parameters of linear projection layers. Then, a linear attention
E t ∈ RW ×H ×Ce is first split into non-overlapping patches with operation replaces the SoftMax to prevent gradient vanishing,
the size 4 × 4 and then projected to tokens with the dimension and it can be depicted as:

C by a patch embedding layer. Furthermore, all tokens are at = (elu (Q t ) + 1) elu (K t )⊤ + 1 Vt , (5)
input to four transformer layers with different block numbers
(i.e., 2, 2, 6, and 2), and each transformer layer performs the where elu is the ELU activation function. Finally, the attention
downsampling operation to reduce the spatial resolution and feature map At can be obtained by an extension with
increases the channel number with a factor of 2. m independent linear-attention operations and project their
2) Transformer Decoder: As a symmetrical architecture, concatenated outputs as:
the corresponding decoder is also a hierarchical network with h i
four transformer layers. In detail, each layer first increases the At = at1 ; . . . ; atm Wa , (6)
channel number and then decreases the spatial resolution via where Wa denotes a linear layer that is used to project the
the patch-splitting operation. After that, the last transformer attended vector.
layer further refines the feature map dt0 , and a task-specific As a result, the final output spatio-temporal feature map fˆt
head followed by the sigmoid function is implemented to can be formulated as:
predict a dense depth map Dt as the final output.
fˆt = At + f t + FFN (At + f t ) . (7)
C. Gate Recurrent Vision Transformer The update gate determines how much temporal clue will
Temporal transformers have great success in various video be passed to the next time step. f t and h t−1 are concatenated
sequence tasks [32], [33], which efficiently model temporal and passed to a linear projection layer followed by a sigmoid
dependencies in a parallel manner. Nevertheless, one problem function to output the gate Ut , which can be expressed as:
is that these parallel processing temporal transformers require Ut = σ f t ; h t−1 W p ,

(8)
a large GPU memory. Another problem is that the temporal
information extracted from asynchronous events in batch where W p refers to the linear projection layer, and σ (·)
mode is limited. To overcome these problems, we design a indicates the sigmoid activation function.
gate recurrent vision transformer (GRViT) that introduces a In fact, Ut determines how much attended information to
recursive mechanism into Vision Transformer (e.g., ViT [28]), keep and how much temporal information in the previous
which can improve temporal modeling capabilities for event hidden state to discard. Thus, the current hidden state h t can
streams while alleviating the expensive GPU memory cost. be computed as follows:
The overview diagram of the proposed GRViT is shown in h t = (1 − Ut ) ⊙ h t−1 + Ut ⊙ At . (9)
Fig. 2(b). For the current event temporal bin St , our GRViT G
takes the feature map f t and the hidden state h t−1 from the
previous temporal bin as the input, then outputs the current D. Cross Attention-Guided Skip Connection
hidden state h t and the spatio-temporal feature map fˆt , and it Most event-based monocular depth estimators [10], [11],
can be formulated as: [13] adopt the aggregation operation (e.g., ADD or CONCAT)
as a skip connection to fuse multi-scale features. However,
( fˆt , h t ) = G( f t , h t−1 ). (3) these fusion strategies insufficiently exploit global spatial

Fig. 2. The structure of the proposed event-based monocular depth estimator with recurrent transformers. (a) The overall workflow of our EReFormer. The
event stream is first converted into event embedding [52] and then split into non-overlapping patches. Then, the patches are processed via an encoder-decoder
sub-network with transformer blocks. (b) The proposed GRViT incorporates a recursive mechanism into transformers to leverage temporal cues. (c) The
designed CASC module is presented as a skip connection to fuse multi-scale features.

context information from sparse asynchronous events. Thus, where Dm,t is the metric depth map and Dmax is the maximum
we propose a cross attention-guided skip connection (CASC) observed depth, ϵ is used to map the minimum observed depth
to overcome this problem via cross-attention learning. to 0. In our experiments, Dmax = 80 and ϵ = 3.7.
Our CASC module mainly consists of two core trans- For training losses, we use the scale-invariant loss [53],
former blocks, namely regular window-based and shifted which is defined as:
window-based multi-head self-attention (i.e., WSMA and !2
SWMSA [27]). As illustrated in Fig. 2(c), WSMA and 1X λ X
Lt,si = (Rt (i)) − 2
2
Rt (i) , (12)
SWMSA perform the cross-attention operation with a residual n n
i i
connection, respectively. Take WMSA for instance, we use the
decoded feature map dt to generate query (Q t ), and utilize the where Rt = D̄t − Dt , λ = 0.85 and n is the number
output fˆt of GRViT to generate the key (K t ) and value (Vt ). of valid ground truth pixels i. Follow the practice of
Taking the triplet (i.e, Q t , K t , and Vt ) as the input, our CASC E2Depth [20], we also use a multi-scale scale-invariant
module progressively models spatial contextual information gradient matching loss Lt,grad that encourages smooth depth
and outputs the cross-attention feature map d̄t . Finally, the changes and enforces sharp depth discontinuities in the depth
fused feature map d̂t is obtained by a residual connection to map prediction. Finally, the resulting total loss for a sequence
integrate dt and d̄t . Thus, our CASC module can be formulated of length L is:
as follows: L−1
X
α Lt,si + βLt,grad ,
p
Ltot = (13)

d̃t = WMSA dt , fˆt + FFN WMSA dt , fˆt
t=0

where the hyperparameter α is set to 10 for all experiments,

d̄t = SWMSA d̃t , fˆt + FFN SWMSA d̃t , fˆt and the weight of the gradient loss β is chosen as 0.25.

d̂t = d̄t + dt , (10)

V. E XPERIMENTS
where d̃t is the output of the first-stage cross-attention of our This section will first describe the experimental settings.
CASC module. For simplicity, the normalization operation is Then, we conduct the main experiments, ablation experiments,
omitted in the above formulation. and scalability experiments to verify our approach.

E. Training Details A. Experimental Settings

We train the network to predict a normalized log depth map 1) Datasets: We report experimental results on a synthetic
which makes it easy to learn large depth variations [20], [22]. dataset (i.e., DENSE [10]) and a real-world dataset (i.e.,
The log depth is calculated as: MVSEC [35]). Following the previous work [10], [11], [13],
1 Dm,t the DENSE dataset contains three subsets including Town01 to
Dt = log + 1, (11)
ϵ Dmax Town05 for training, Town06 and Town07 for validation, and

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7422 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

TABLE II
Q UANTITATIVE R ESULTS ON THE MVSEC DATASET

Fig. 3. Representative examples of four test sequences in the MVSEC dataset. The first row to the fourth row corresponds to the outdoor day1, outdoor
night1, outdoor night2, and outdoor night3, respectively. The second column refers to the MegaDepth [20] prediction using the APS frames. Note that,
MegaDepth fails to predict the fine-grained depth map at low-light conditions. Compared with E2Depth+ [10], our EReFormer can achieve more globally
coherent predictions both day and night, which is closer to the ground truth.

Town10 for testing. For the MVSEC dataset, we use outdoor 3) Implementation Details: Our EReFormer is implemented
day2 for training and four sequences (i.e., outdoor day1, and using the Pytorch framework [54]. We use Swin-T [27]
outdoor night1 to outdoor night3) for testing. pre-trained on ImageNet as the backbone to achieve an
2) Evaluation Metrics: To compare different methods, accuracy-speed trade-off. We set the channel number C to 96.
absolute relative error (Abs.Rel.), logarithmic mean squared During training, we use AdamW optimizer [55] with weight-
error (RMSELog), scale-invariant logarithmic error (SILog), decay 0.1 and set the 1-cycle policy [56] for the learning
accuracy (δ < 1.25n , n = 1, 2, 3), average absolute depth rate with max_lr = 3.2 × 10−5 . We train our network for
errors at different cut-off depth distances (i.e., 10m, 20m, 200 epochs with batch size 2. Further, we use the method
and 30m), and running time (ms) are selected as six typical of Truncated Backpropagation Through Time (TBPTT) in
evaluation metrics, which are the most broadly utilized in the training to prevent gradient vanishing or exploding and unroll
depth estimation task. the sequence by 16 steps due to memory limitations. All

TABLE III
Q UANTITATIVE R ESULTS ON THE DENSE DATASET

Fig. 4. Representative examples of the testing sequence in the DENSE dataset. Obviously, our EReFormer obtains finer-grained and more globally coherent
dense depth maps than the best event-based competitor that utilizes E2Depth+ [10] to process the event stream.

experiments are conducted on NVIDIA Tesla V100-PCLE EReFormer achieves the best performance across the whole
GPUs. test sets, especially the most valuable metric (i.e., Abs.Rel.).
4) Comparisons: To verify the effectiveness of the proposed At the same time, we can see that DPT [21] using vision
approach, we compare our EReFormer with four state-of-the- transformers obtains better performance than the best CNN-
art methods (i.e., E2Depth [10] for voxel grid, DTL− [13] for based method E2Depth+ [10], which proves that utilizing the
event image, E2Depth+ [10] for voxel grid, and DPT [21] global spatial information from sparse events helps predict
for event image). It should be noted that E2Depth+ is more accurate depth map in different scenarios. Although DPT
pretrained on the first 1000 samples in the DENSE dataset has achieved satisfactory results for event-based monocular
and then retrained on both two datasets, which share the dense depth estimation, it is sub-optimal due to not leveraging
same architecture with E2Depth. DTL− selects one branch rich temporal cues from continuous event streams. Compared
of standard DTL [13] to convert each event image into a to the average absolute depth error of 10m, 20m, and 30m
depth map. DPT is an outstanding frame-based monocular with DPT, our EReFormer achieves more accurate depth
depth estimator that adopts vision transformers to process predictions at all distances with an average improvement
each event image. To be fair, we evaluate DTL− and the overall test sequences of 14.8% at 10m, 15.1% at 20m, and
DPT architecture in the same experimental settings as our 9.4% at 30m with respect to values of DPT. In addition,
approach. our EReFormer is almost comparable to the computational
speed of DPT. Overall, it can be concluded that efficient
global sparse spatial modeling and temporal utilization can
B. Main Experiments improve the performance of event-based monocular depth
1) Evaluation on the MVSEC Dataset: As is illustrated estimation. We further present some visualization results on
in Table II, we quantitatively compare our EReFormer with the MVSEC Dataset in Fig. 3. Our EReFormer shows apparent
four state-of-the-art methods on the MVSEC dataset [35]. advantages on the HDR scene when the APS frames (the
All networks predict depth in the logarithmic scale, which second column) fail to predict the correct depth information
is normalized and restored to absolute values by multiplying in low-light conditions. Compared with the E2Depth+, even
by the maximum depth clipped at 80 m. Note that, our if it was trained on both two datasets, our EReFormer

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7424 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

TABLE IV TABLE VI
P ERFORMANCE C OMPONENTS OF O UR ER E F ORMER A BLATING H IDDEN S TATE T RANSFER IN THE GRV I T

TABLE VII
TABLE V
C OMPARISON OF U SING VARIOUS E VENT
C OMPARISON W ITH T YPICAL S KIP C ONNECTION S TRATEGIES R EPRESENTATIONS IN ER E F ORMER

trained only on the MVSEC dataset predicts finer-grained

almost comparable computational costs. For example, our
depth maps.
strategy obtains finer-grain predictions at all distances with
2) Evaluation on the DENSE Dataset: We further report
an improvement of 5.8% at 10m, 6.6% at 20m, and 4.8% at
quantitative results on the synthetic DENSE dataset [10]
30m with the ADD operation.
to validate the effectiveness of our EReFormer. As shown
3) Ablating Hidden State Transfer: We compare the update
in Table III, our EReFormer achieves the best absolute
gate in our GRViT with two typical methods in Table VI.
relative error (Abs.Rel.) on three sequences. Meanwhile, our
The attended operation only uses the attended feature At
EReFormer improves the average absolute depth error about
from the attention gate as the current hidden state h t . The
by 45.3% at 10m, 17.3% at 20m, and 14.9% at 30m with
residual operation adds a residual connection between the
respect to values of DPT on the testing sequence Town10.
hidden state h t−1 and the attended feature At . From Table VI,
Besides, we find that our approach is sub-optimal in some
we find that the residual connection achieves worse results
metrics on the validation sequence Town06. This is because the
because the temporal information from a long time window is
distribution of all scenarios in Town06 is too monolithic. From
not forgotten. On the contrary, our update gate outperforms
Fig. 4, some visualization examples show that our EReFormer
two compared transfer methods while maintaining almost
obtains higher-quality depth maps over the best event-based
comparable computational speed.
competitor (i.e., E2Depth+ [10]).
4) Influence of Event Representation: To verify the
generality of our EReFormer for various event representations,
C. Ablation Experiments our EReFormer compares three typical event representations
Beyond the main experiments, we next conduct ablation (i.e., event images [51], voxel grids [11] and sigmoid
experiments on the MVSEC dataset (e.g., outdoor day1 representation [57]) owning to an accuracy-speed trade-off.
sequence) to take a deep look at the impact of each design As illustrated in Table VII, the performance of our EReFormer
choice. varies with different event representations, but their running
1) Contribution of Each Component: As shown in Table IV, speed is almost comparable. It indicates that our EReFormer
two methods, namely (a) and (b), utilize a cross attention- can provide a generic interface in combination with various
guided skip connection (CASC) module to fuse multi-scale input representations, such as image-like representations
features, and a gate recurrent vision transformer (GRViT) for (e.g., event images [51] and sigmoid representation [57])
temporal modeling, consistently achieve better performance and spatiotemporal representations (i.e., voxel grids [11]).
on the outdoor day1 sequence than the baseline using the Indeed, we believe that a good event representation makes
transformer-based encoder-decoder backbone. More precisely, asynchronous events directly compatible with our EReFormer
compared (a) and the baseline, the absolute promotion is 3.3%, while maximizing the depth estimation performance.
which demonstrates that it is feasible to adopt the CASC 5) Influence of Encoder Backbone: As shown in Table VIII,
module between the encoder and the decoder sub-networks. we verify the generality of our EReFormer for various encoder
Our GRViT, leveraging temporal cues, obtains the 8.2% Abs. backbones. More precisely, we implement three variants
Rel. improvement over the baseline. Besides, the last row of with different encoder backbones (i.e., Swin-T [27], Swin-
Table IV shows that the computational speeds of these methods B [27] and Swin-L [27]). The performance of our EReFormer
are almost comparable. varies with different encoder backbones, for instance, the
2) Comparison With Skip Connection Strategies: We corresponding Abs. Rel. of EReFormer using Swin-L improves
compare the CASC module in our EReFormer networks with 1.1% compared with Swin-B and improves 1.5% compared
some typical operations (e.g., ADD and CONCAT) in Table V. with Swin-T. The reason is that with the increase in the number
Notably, our CASC module achieves the best performance of channels and layers, the global spatial modeling ability of
against the ADD and CONCAT operations while keeping the encoder is improved, so better performance is obtained.

TABLE VIII
C OMPARISON W ITH VARIOUS E NCODER BACKBONES

Fig. 5. Qualitative results comparing depth prediction using MegaDepth [20]

on reconstructed frames from E2VID [58] and our EReFormer using events.

TABLE IX TABLE XI
GPU M EMORY C OST A NALYSIS T HE PARAMETERS AND GFLOP S OF D IFFERENT M ETHODS

TABLE X TABLE XII

C OMPARISON W ITH F RAME -BASED M ETHODS C OMPARISON OF U SING S UBSEQUENCES OF VARYING
ON THE MVSEC DATASET L ENGTHS FOR ER E F ORMER T RAINING

two-stage, which is not as convenient as generating depth maps

directly from events in one-stage (i.e., end to end, no need
to parse through image reconstruction). Furthermore, event-
based methods (i.e., E2Depth+ [10] and our EReFormer) give
a better estimate than frame-based method using E2VID (i.e.,
Mega+ [20]). Fig. 5 shows a case of depth prediction using
MegaDepth on a reconstructed frame from E2VID and our
EReFormer using events, which indicates that our EReFormer
It indicates that our EReFormer can be combined with various predicts finer-grained depth maps.
encoder backbones. 8) Model Efficiency: We also compare the model efficiency
6) GPU Memory Cost Analysis: As illustrated in Table IX, (number of parameters and GFLOPs) between EReFormer
our EReFormer compares two transformer-based temporal and existing event-based monocular depth estimation methods.
modeling methods (i.e., GRViT and temporal transformer [32]) As shown in Table XI, EReFormer achieves the best absolute
to further verify the effectiveness of our GRViT. Specifically, relative error (Abs.Rel.) and mean absolute depth error on
we apply the temporal transformer module [32] on the last popular event-based monocular depth estimation datasets
encoding layer of our EReFormer, the temporal transformer while owning moderate parameters and GFLOPs compared
3 (B−1)/2
module takes feature maps f t+i i=−(B−1)/2
as input and to the state-of-the-art approaches. Specifically, although
output the spatio-temporal feature map fˆt3 , followed by the EReFormer has a larger number of parameters compared to
decoder sub-network to finally output predicted depth map, E2Depth, it exhibits a relatively lower computational demand
where B is the batch size of temporal transformer module. with 19.2% fewer GFLOPs. DTL− has fewer GFLOPs than
From Table IX, the temporal transformer extracts the temporal EReFormer, but there is still a certain performance gap
information in batch mode (e.g., B = 7 / 9), requiring a between them. Compared to DPT, which obtains competitive
large GPU memory (e.g., Mem = 3.25GB / 3.72GB) and is monocular depth estimation performance on the DENSE
very time-consuming. In contrast, our GRViT that introduces a dataset, EReFormer improves the average absolute depth error
recursive mechanism into transformers obtains the 6.6% / 6.2% about by 14.9%, while the number of parameters is reduced
Abs. Rel. improvement over temporal transformer module and by 35.3% and GFLOPs by 48.3%.
uses only 2.38GB memory (1.37× / 1.56× less) during one- 9) Influence of Subsequence Length in Training: To prevent
view inference. gradient vanishing or exploding, we employ the method of
7) Comparison With Frame-Based Method: As is shown Truncated Backpropagation Through Time (TBPTT) during
in Table X, frame-based methods have difficulties to predict EReFormer training. This involves dividing the sequence
depth maps in low light conditions. Mega+ [20], using into smaller subsequences and only calculating the gradient
frames reconstructed from events by E2VID [58], performs on these subsequences. During forward propagation, each
more accurately at night sequences. However, this mode is subsequence is processed sequentially, with the hidden state

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7426 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

Fig. 6. Representative visualization results on continuous sequences of the MVSEC dataset. Compared with the feed-forward baseline without GRViT (i.e.,
w/o GRViT), our EReFormer (i.e., w GRViT) performs better and obtains temporal consistent estimation results.

Fig. 7. Representative examples of three motion blur scenarios. The second column refers to the MegaDepth [20] prediction using the blurred APS frames.
Note that, MegaDepth fails to predict the fine-grained depth map at motion blur conditions. Compared with E2Depth+ [10], our EReFormer can achieve more
globally coherent predictions even in motion blur scenarios, closer to the ground truth.

from the previous subsequence serving as the initial hidden first row in Fig. 6. But fortunately, our EReFormer overcomes
state for the next subsequence. We also conduct experiments these problems via leveraging rich temporal cues from the
to investigate the impact of using subsequences of different continuous event stream and performs temporal consistent
lengths for training EReFormer. We adjust the subsequence estimation results.
length to 4, 8, 12, and 16 to evaluate performance. As shown 2) Representative Examples in Motion Blur Scenarios:
in Table XII, it indicates that longer subsequences yield better We further present some visualization results in Fig. 7. RGB
performance. Consequently, due to memory limitations, we set frames fail to predict the fine-grained depth map in high-
the subsequence length of 16 in EReFormer training. speed motion blur scenarios. Much to our surprise, our
EReFormer, inheriting the high temporal resolution and the
HDR properties from DVS events, performs robust depth
D. Scalability Experiments estimation in challenging scenarios. In other words, event
This subsection will present the visualization of the cameras can perform robust monocular depth estimation even
temporal modeling operation, then we further present some in cases where conventional cameras fail, e.g., fast-motion and
representative examples in motion blur scenarios. Finally, low-light scenarios.
we analysis some failure cases from our EReFormer. 3) Failure Case Analysis: Although our EReFormer
1) Visualization of Temporal Modeling: As shown in Fig. 6, achieves satisfactory results even in challenging scenes, some
we further present some comparative visualization results failure cases still remain. As depicted in Fig. 8, the first and
about whether they utilize rich temporal cues or not. The third columns show that the extreme slow-moving scene is
feed-forward baseline, using an event temporal bin, suffers hard to perform high-quality depth prediction. This is because
from some failure cases involving buildings, as shown in the event cameras evidently sense dynamic changes, but they

to process asynchronous events. Another promising future

research work is to treat the raw input events as 3D point
cloud and use PointNet [61] architectures to extract spatial-
temporal features.
Fig. 8. A failure case of our EReFormer in the slow-moving scenario. Our VII. C ONCLUSION
EReFormer is hard to generate dense depth maps without enough events.
Notably, the usage of auxiliary frames can improve the performance of This paper presents a novel event-based monocular depth
monocular depth estimation. estimator with recurrent transformers (i.e., EReFormer), which
effectively models global sparse spatial context information
generate almost no events in static or slow-moving scenarios. and leverages rich temporal cues from a continuous event
The last two columns in Fig. 8 indicate that the DAVIS stream. To the best of our knowledge, this is the first work
camera [49], streaming two complementary modalities of to explore such a pure transformer to predict dense depth
events and frames, providing a viable solution to address the maps using a monocular event camera. Our EReFormer
above limitation. In fact, how to design a pure transformer to consists of two core modules, namely a gate recurrent
integrate events and frames for dense depth estimation is a vision transformer (GRViT) and a cross attention-guided skip
worthwhile topic in the future. connection (CASC). The results show that our EReFormer
outperforms state-of-the-art methods by a margin on both
VI. D ISCUSSION
synthetic and real-world datasets. We believe that our
An effective and robust event-based monocular depth EReFormer acts as a bridge between event cameras and
estimator will further highlight the potential of the unifying practical applications involving monocular depth estimation,
framework using events. Here, we will discuss the generality especially in fast-motion and low-light scenarios.
and the limitation of our method as follows. Finally, we give
a further outlook on the future work. R EFERENCES
[1] X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards
A. Generality real-time monocular depth estimation for robotics: A survey,” IEEE
Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 16940–16961, Oct. 2022.
One might think that this work does not explore one [2] Z. Shen, C. Lin, L. Nie, K. Liao, and Y. Zhao, “Neural contourlet
core issue of how to design a novel event representation. network for monocular 360◦ depth estimation,” IEEE Trans. Circuits
Actually, any event representation can be regarded as the Syst. Video Technol., vol. 32, no. 12, pp. 8574–8585, Dec. 2022.
[3] M. Song, S. Lim, and W. Kim, “Monocular depth estimation using
input of our EReFormer. The ablation study verifies that our Laplacian pyramid-based depth residuals,” IEEE Trans. Circuits Syst.
EReFormer can improve the depth estimation performance via Video Technol., vol. 31, no. 11, pp. 4381–4393, Nov. 2021.
introducing DVS events with different event representations. [4] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images
as classification using deep fully convolutional residual networks,” IEEE
It should be noted that our EReFormer is not to design a Trans. Circuits Syst. Video Technol., vol. 28, no. 11, pp. 3174–3182,
SOTA event representation, but to overcome spatio-temporal Nov. 2018.
modeling challenges via recurrent transformers. [5] Z. Hu, L. Xu, and M.-H. Yang, “Joint depth estimation and camera
shake removal from single blurry image,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2014, pp. 2893–2900.
B. Limitation [6] K. Wang et al., “Regularizing nighttime weirdness: Efficient self-
supervised monocular depth estimation in the dark,” in Proc. IEEE Int.
The proposed EReFormer uses a pure transformer structure. Conf. Comput. Vis., Oct. 2021, pp. 16055–16064.
As the complex attention mechanisms and model design, the [7] G. Gallego et al., “Event-based vision: A survey,” IEEE Trans. Pattern
transformer-based architecture may encounter challenges in Anal. Mach. Intell., vol. 44, no. 1, pp. 154–180, Jan. 2022.
terms of slow inference and does not support low precision [8] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and
T. Delbruck, “Retinomorphic event-based vision sensors: Bioinspired
acceleration (e.g., fp16 acceleration), which may limit their cameras with spiking output,” Proc. IEEE, vol. 102, no. 10,
practical applicability. One possible solution may be to design pp. 1470–1484, Oct. 2014.
more lightweight transformer models [59] by combining the [9] S. M. N. Uddin, S. H. Ahmed, and Y. J. Jung, “Unsupervised deep event
stereo for depth estimation,” IEEE Trans. Circuits Syst. Video Technol.,
strengths of CNNs and Transformers. Transformers are still vol. 32, no. 11, pp. 7489–7504, Nov. 2022.
popular and frontier research topics, and exploring acceleration [10] J. Hidalgo-Carrio, D. Gehrig, and D. Scaramuzza, “Learning monocular
methods is an ongoing topic of interest. dense depth from events,” in Proc. IEEE Conf. Int. Conf. 3D Vis.,
Nov. 2020, pp. 534–542.
[11] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised
C. Future Work event-based learning of optical flow, depth, and egomotion,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
Our EReFormer is designed for event-based monocular pp. 989–997.
depth estimation. However, we do not leverage the asyn- [12] K. Chaney, A. Z. Zhu, and K. Daniilidis, “Learning event-based height
chronous natural from event-based data. Although some from plane and parallax,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Syst. (IROS), Nov. 2019, pp. 3690–3696.
asynchronous methods are explored in some tasks (e.g., object
[13] L. Wang, Y. Chae, and K.-J. Yoon, “Dual transfer learning for event-
classification and object detection), their performance is too far based end-task prediction via pluggable event to image translation,” in
from that of frame-like representation approaches. Therefore, Proc. IEEE Int. Conf. Comput. Vis., Oct. 2021, pp. 2135–2145.
processing event-based data asynchronously still remains a [14] G. Gallego, H. Rebecq, and D. Scaramuzza, “A unifying contrast
maximization framework for event cameras, with applications to motion,
topic worth exploring. One promising future research work depth, and optical flow estimation,” in Proc. IEEE/CVF Conf. Comput.
is to use event-driven Spiking Neural Networks (SNNs) [60] Vis. Pattern Recognit., Jun. 2018, pp. 3867–3876.

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7428 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024

[15] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza, “EMVS: [38] C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation
Event-based multi-view stereo—3D reconstruction with an event camera with a vision transformer,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2022,
in real-time,” Int. J. Comput. Vis., vol. 126, no. 12, pp. 1394–1414, pp. 668–678.
Dec. 2018. [39] Z. Cheng, Y. Zhang, and C. Tang, “Swin-depth: Using transformers and
[16] G. Gallego, M. Gehrig, and D. Scaramuzza, “Focus is all you need: Loss multi-scale fusion for monocular-based depth estimation,” IEEE Sensors
functions for event-based vision,” in Proc. IEEE/CVF Conf. Comput. Vis. J., vol. 21, no. 23, pp. 26912–26920, Dec. 2021.
Pattern Recognit. (CVPR), Jun. 2019, pp. 12272–12281. [40] S. Hwang, S. Park, J. Baek, and B. Kim, “Self-supervised monocular
[17] M. Cui, Y. Zhu, Y. Liu, Y. Liu, G. Chen, and K. Huang, “Dense depth- depth estimation using hybrid transformer encoder,” IEEE Sensors J.,
map estimation based on fusion of event camera and sparse LiDAR,” vol. 22, no. 19, pp. 18762–18770, Oct. 2022.
IEEE Trans. Instrum. Meas., vol. 71, pp. 1–11, 2022. [41] D. Han, J. Shin, N. Kim, S. Hwang, and Y. Choi, “TransDSSL:
[18] H. Cho, J. Jeong, and K.-J. Yoon, “EOMVS: Event-based omnidi- Transformer based depth estimation via self-supervised learning,” IEEE
rectional multi-view stereo,” IEEE Robot. Autom. Lett., vol. 6, no. 4, Robot. Autom. Lett., vol. 7, no. 4, pp. 10969–10976, Oct. 2022.
pp. 6709–6716, Oct. 2021. [42] Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long-
[19] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D range correlation and local information for accurate monocular depth
reconstruction and 6-DoF tracking with an event camera,” in Computer estimation,” 2022, arXiv:2203.14211.
Vision—ECCV 2016: 14th European Conference, Amsterdam, The [43] A. Sabater, L. Montesano, and A. C. Murillo, “Event transformer.
Netherlands, October 11–14, 2016, Proceedings, Part VI 14. Springer, A sparse-aware solution for efficient event data processing,” in Proc.
2016, pp. 349–364. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[20] Z. Li and N. Snavely, “MegaDepth: Learning single-view depth Jun. 2022, pp. 2677–2686.
prediction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. [44] W. Weng, Y. Zhang, and Z. Xiong, “Event-based video reconstruction
Pattern Recognit., Jun. 2018, pp. 2041–2050. using transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
[21] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for Oct. 2021, pp. 2543–2552.
dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [45] Y. Alkendi, R. Azzam, A. Ayyad, S. Javed, L. Seneviratne, and Y. Zweiri,
Oct. 2021, pp. 12179–12188. “Neuromorphic camera denoising using graph neural network-driven
[22] D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, transformers,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 3,
“Combining events and frames using recurrent asynchronous multimodal pp. 4110–4124, Mar. 2024.
[46] J. Zhang et al., “Spiking transformers for event-based single object
networks for monocular depth prediction,” IEEE Robot. Autom. Lett.,
tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2022,
vol. 6, no. 2, pp. 2822–2829, Apr. 2021.
pp. 8801–8810.
[23] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
[47] J. Zhao, S. Zhang, and T. Huang, “Transformer-based domain adaptation
networks for biomedical image segmentation,” in Medical Image
for event data classification,” in Proc. Int. Conf. Acoust., Speech, Signal
Computing and Computer-Assisted Intervention—MICCAI 2015: 18th
Process., 2022, pp. 4673–4677.
International Conference, Munich, Germany, October 5–9, 2015, [48] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 × 128 120 dB 15 µs
Proceedings, Part III 18. Springer, 2015, pp. 234–241. latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-
[24] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, State Circuits, vol. 43, no. 2, pp. 566–576, Feb. 2008.
“Convolutional LSTM network: A machine learning approach for [49] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck,
precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., “A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision
vol. 28, 2015, pp. 1–9. sensor,” IEEE J. Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341,
[25] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into Oct. 2014.
convolutional networks for learning video representations,” 2015, [50] J. Li, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Asynchronous
arXiv:1511.06432. spatio-temporal memory network for continuous event-based object
[26] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. detection,” IEEE Trans. Image Process., vol. 31, pp. 2975–2987,
Process. Syst., 2017, pp. 5998–6008. 2022.
[27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using [51] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza,
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), “Event-based vision meets deep learning on steering prediction for self-
Oct. 2021, pp. 10012–10022. driving cars,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
[28] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Jun. 2018, pp. 5419–5427.
for image recognition at scale,” 2020, arXiv:2010.11929. [52] D. Gehrig, A. Loquercio, K. Derpanis, and D. Scaramuzza, “End-to-end
[29] V. Guizilini, R. Ambruş, D. Chen, S. Zakharov, and A. Gaidon, “Multi- learning of representations for asynchronous event-based data,” in Proc.
frame self-supervised depth with transformers,” in Proc. IEEE/CVF IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5633–5643.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 160–170. [53] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
[30] M. S. Junayed, A. Sadeghzadeh, M. B. Islam, L.-K. Wong, and T. Aydin, single image using a multi-scale deep network,” in Proc. Adv. Neural
“HiMODE: A hybrid monocular omnidirectional depth estimation Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [54] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. Adv.
Workshops (CVPRW), Jun. 2022, pp. 5208–5217. Neural Inf. Process. Syst., 2017, pp. 1–4.
[31] Y. Wang, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Learning [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
stereo depth estimation with bio-inspired spike cameras,” in Proc. IEEE in Proc. ICLR, 2014, pp. 1–15.
Conf. Multimedia Express, Jul. 2022, pp. 1–6. [56] L. N. Smith and N. Topin, “Super-convergence: Very fast training of
[32] M. Cao, Y. Fan, Y. Zhang, J. Wang, and Y. Yang, “VDTR: Video neural networks using large learning rates,” Proc. SPIE, vol. 11006,
deblurring with transformer,” IEEE Trans. Circuits Syst. Video Technol., pp. 369–386, May 2019.
vol. 33, no. 1, pp. 160–171, Jan. 2023. [57] N. F. Y. Chen, “Pseudo-labels for supervised learning on dynamic vision
[33] J. Li et al., “Video semantic segmentation via sparse temporal sensor data, applied to object detection under ego-motion,” in Proc.
transformer,” in Proc. ACM Int. Conf. Multimedia., 2021, pp. 59–68. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[34] J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, and D. Yu, “Recurring the Jun. 2018, pp. 644–653.
[58] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High
transformer for video action recognition,” in Proc. IEEE Conf. Comput. speed and high dynamic range video with an event camera,” IEEE
Vis. Pattern Recognit., Jun. 2022, pp. 14063–14073. Trans. Pattern Anal. Mach. Intell., vol. 43, no. 6, pp. 1964–1980,
[35] A. Z. Zhu, D. Thakur, T. Ozaslan, B. Pfrommer, V. Kumar, and Jun. 2021.
K. Daniilidis, “The multivehicle stereo event camera dataset: An event [59] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose,
camera dataset for 3D perception,” IEEE Robot. Autom. Lett., vol. 3, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn.
no. 3, pp. 2032–2039, Jul. 2018. Represent., 2021, pp. 1–26.
[36] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle [60] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal
avoidance for quadrotors with event cameras,” Sci. Robot., vol. 5, no. 40, backpropagation for training high-performance spiking neural networks,”
Mar. 2020, Art. no. eaaz9712. Front. Neurosci., vol. 12, p. 331, May 2018.
[37] A. Mitrokhin, P. Sutor, C. Fermüller, and Y. Aloimonos, “Learning [61] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet:
sensorimotor control with neuromorphic sensors: Toward hyperdimen- Deep learning on point sets for 3D classification and segmentation,”
sional active perception,” Sci. Robot., vol. 4, no. 30, May 2019, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Art. no. eaaw6736. pp. 652–660.

Xu Liu received the B.S. degree from the College Yonghong Tian (Fellow, IEEE) is currently the
of Computer Science and Technology, Harbin Dean of the School of Electronics and Computer
Engineering University (HEU), Harbin, China, Engineering, a Boya Distinguished Professor with
in 2019. He is currently pursuing the Ph.D. degree the School of Computer Science, Peking University,
with the Research Center of Intelligent Interface China, and the Deputy Director of Artificial Intelli-
and Human Computer Interaction, Department of gence Research with the Peng Cheng Laboratory,
Computer Science and Technology, Harbin Institute Shenzhen, China. His research interests include
of Technology (HIT), Harbin. His current research neuromorphic vision, distributed machine learning,
interests include neuromorphic vision, deep learning, and multimedia big data. He is the author or
3D vision, and monocular depth estimation. coauthor of over 350 technical articles in refereed
journals and conferences. He was/is an Associate
Editor of IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO
T ECHNOLOGY from January 2018 to December 2021, IEEE T RANSACTIONS
Jianing Li (Member, IEEE) received the B.S. degree ON M ULTIMEDIA from August 2014 to August 2018, IEEE Multimedia
from the College of Computer and Information Magazine from January 2018 to August 2022, and IEEE ACCESS from
Technology, China Three Gorges University, China, January 2017 to December 2021. He co-initiated IEEE International
in 2014, the M.S. degree from the School of Conference on Multimedia Big Data (BigMM). He served as the TPC Co-
Microelectronics and Communication Engineering, Chair for BigMM 2015. He also served as the Technical Program Co-Chair
Chongqing University, China, in 2017, and the Ph.D. for IEEE ICME 2015, IEEE ISM 2015, and IEEE MIPR 2018/2019, and
degree from the National Engineering Research the General Co-Chair for IEEE MIPR 2020 and ICME2021. He is a TPC
Center for Visual Technology, School of Computer Member of more than ten conferences such as CVPR, ICCV, ACM KDD,
Science, Peking University, Beijing, China, in 2022. AAAI, ACM MM, and ECCV. He was a recipient of Chinese National Science
He is the author or coauthor of over 20 technical Foundation for Distinguished Young Scholars in 2018, two National Science
papers in refereed journals and conferences, such and Technology Awards and three ministerial-level awards in China, and
as IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLI - obtained the 2015 EURASIP Best Paper Award for EURASIP Journal on
GENCE, IEEE T RANSACTIONS ON I MAGE P ROCESSING , CVRP, ICCV, and Image and Video Processing, the Best Paper Award of IEEE BigMM 2018,
AAAI. He was honored with the Lixin Tang Scholarship from Chongqing and the 2022 IEEE SA Standards Medallion and SA Emerging Technology
University, China, in 2016. He received the Outstanding Research Award, Award. He is a Senior Member of CIE and CCF and a member of ACM.
Peking University, in 2020. His research interests include event-based vision,
neuromorphic engineering, and robot learning.

Jinqiao Shi was born in 1978. He received the Ph.D.

degree from Harbin Institute of Technology, Harbin,
China. He is currently a Professor and a Ph.D.
Supervisor with Beijing University of Posts and
Telecommunications, Beijing, China. His current
research interests include network and information
security, and artificial intelligence.

Xiaopeng Fan (Senior Member, IEEE) received

the B.S. and M.S. degrees from Harbin Institute
of Technology (HIT), Harbin, China, in 2001 and
2003, respectively, and the Ph.D. degree from The
Hong Kong University of Science and Technology,
Hong Kong, in 2009. In 2009, he joined HIT, where
he is currently a Professor. From 2003 to 2005,
he was with Intel Corporation, China, as a Software
Engineer. From 2011 to 2012, he was with Debin Zhao (Member, IEEE) received the B.S.,
Microsoft Research Asia, as a Visiting Researcher. M.S., and Ph.D. degrees in computer science from
From 2015 to 2016, he was with The Hong Kong Harbin Institute of Technology, China, in 1985,
University of Science and Technology as a Research Assistant Professor. 1988, and 1998, respectively.
He has authored one book and more than 180 articles in refereed journals Since 2018, he has been with the Peng Cheng
and conference proceedings. His research interests include video coding and Laboratory. He is currently a Professor with the
transmission, image processing, and computer vision. He was the Program Department of Computer Science. He has published
Chair of PCM2017, the Chair of IEEE SGC2015, and the Co-Chair of over 200 technical articles in refereed journals and
MCSN2015. He was an Associate Editor of IEEE 1857 Standard in 2012. conference proceedings in the areas of image and
He was a recipient of Outstanding Contributions to the Development of IEEE video coding, video processing, video streaming and
Standard 1857 by IEEE in 2013. transmission, and computer vision.

Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.