Event-Based Monocular Depth Estimation With Recurrent Transformers
Event-Based Monocular Depth Estimation With Recurrent Transformers
I. I NTRODUCTION conditions (e.g., motion blur and low light) [5], [6]. Recently,
event cameras [7], [8], offering high temporal resolutions and
M ONOCULAR depth estimation [1], [2], [3], [4] is
one of the critical and challenging topics, supporting
widespread vision applications in a low-cost effective manner.
high dynamic ranges, have been attempted to address these
common challenges [9], [10], [11], [12], [13], [14], [15],
In fact, conventional frame-based cameras have presented [16], [17], [18], [19]. However, a key question remains: How
some shortcomings for depth estimation in challenging to effectively exploit the global spatial information and rich
temporal cues from asynchronous sparse events to generate
Manuscript received 21 March 2023; revised 24 July 2023 and 1 November dense depth maps?
2023; accepted 5 March 2024. Date of publication 18 March 2024; date of
current version 12 August 2024. This work was supported in part by the For spatial modeling, the mainstream event-based monocu-
National Key Research and Development Program of China under Grant lar depth estimators [10], [11], [13], [22] adopt CNN-based
2021YFF0900500; and in part by the National Natural Science Foundation of architectures. For instance, Zhu et al. [11] design an
China (NSFC) under Grant U22B2035, Grant 62272128, Grant 62027804,
and Grant 62088102. This article was recommended by Associate Editor unsupervised CNN-based encoder-decoder network for semi-
Z. Li. (Corresponding author: Xiaopeng Fan.) dense depth estimation. Further, the following works [10],
Xu Liu, Xiaopeng Fan, and Debin Zhao are with the Research Center [13], [22] present supervised training frameworks to generate
of Intelligent Interface and Human Computer Interaction, Department of
Computer Science and Technology, Harbin Institute of Technology, Harbin dense depth maps based on UNet [23]. Although these
150001, China, and also with the Peng Cheng Laboratory, Shenzhen 518000, CNN-based learning methods achieve better performance
China (e-mail: [email protected]; [email protected]; [email protected]). than the model-based optimized approaches [14], [15],
Jianing Li is with the School of Computer Science, Peking University,
Beijing 100871, China (e-mail: [email protected]). [16], [19], they are not capable of utilizing the global
Jinqiao Shi is with the School of Cyberspace Security, Beijing spatial information from asynchronous sparse events due
University of Posts and Telecommunications, Beijing 100871, China (e-mail: to the essential locality of convolution operations. For
[email protected]).
Yonghong Tian is with the School of Computer Science, Peking University, temporal modeling, most existing event-based monocular
Beijing 100871, China, and also with the Peng Cheng Laboratory, Shenzhen depth estimators [10], [22] introduce RNN-based architectures.
518000, China (e-mail: [email protected]). More specifically, the lightweight recurrent convolutional
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2024.3378742. architectures (e.g., ConvLSTM [24] and ConvGRU [25]) are
Digital Object Identifier 10.1109/TCSVT.2024.3378742 attempted to incorporate into UNet [23] for modeling long-
1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
range temporal dependencies. However, these RNN-based are provided in Section V, while some discussions are reported
architectures essentially still use convolution operations to in Section VI. Finally, we conclude the paper in Section VII.
interact spatial and temporal information, showing a limited
capacity for effective temporal modeling. More recently,
transformers [26], [27], [28] demonstrate appealing potential II. R ELATED W ORK
in modeling global spatial context information for frame- A. Event-Based Monocular Depth Estimation
based monocular depth estimation tasks [21], [29], [30], [31].
Not only that, transformers can also effectively establish the Event cameras for monocular depth estimation have become
interaction between spatial and temporal domains via the increasingly popular in robot navigation [7], [36], [37],
self-attention mechanism, they have demonstrated impressive especially involving low-latency obstacle avoidance and high-
performance in temporal sequence tasks [32], [33], [34]. speed path planning. Early model-based works [14], [15],
To this end, this paper proposes a event-based monoc- [16], [17], [19] usually calculate both camera poses and depth
ular depth estimator with recurrent transformers, namely maps via solving a non-linear optimization problem. Yet,
EReFormer, which is the first transformer-based architecture these model-based optimized methods need to obtain camera
with a recursive mechanism to process continuous event poses or auxiliary sensor parameters (e.g., IMU). Recently,
streams, as shown in Fig. 1. Our EReFormer is designed various learning-based methods [10], [11], [13], [22] have
to model global spatial information and long-range temporal been introduced to convert asynchronous events into depth
dependencies from event streams. More specifically, we first maps. Although these CNN-based methods achieve promising
design a transformer-based encoder-decoder backbone using results, they insufficiently exploiting global spatial information
swin transformer blocks [27] for event-based monocular and some of these feed-forward models [11], [13] have not
depth estimation, which utilizes multi-scale features to model yet used rich temporal cues from event streams. In addition,
global spatial information from events. Then, we propose a the lack of effective spatio-temporal information interactions
Gate Recurrent Vision Transformer (GRViT) to leverage rich in RNN-based backbones [10], [22] may limit performance
temporal cues from event streams. The core of GRViT is to improvements.
incorporate a recursive mechanism into Vision Transformer As illustrated in Table I, we make a comprehensive
(e.g., ViT [28]) so that it can model long-range temporal literature review on event-based monocular depth estimation.
dependencies. Finally, we present a Cross-Attention-guided The existing event-based monocular depth estimators can
Skip Connection (CASC) to improve global spatial modeling be broadly classified into two categories (i.e., model-based
capabilities in our EReFormer, which fuses multi-scale optimized methods [14], [15], [17], [19] and learning-based
features by performing cross-attention. The experimental methods [11], [13], [22]). Besides, the predicted density of
results demonstrate that our EReFormer outperforms state- depth models contains three types (i.e., sparse [14], [15], semi-
of-the-art methods by a large margin on both synthetic and dense [11], [19], and dense [10], [13], [17], [22]). The sparse
real-world datasets (i.e., DENSE [10] and MVSEC [35]). Our map refers to the depth at pixels when only events occurred,
EReFormer also verifies that event cameras can perform robust the semi-dense map denotes the depth at the reconstructed
monocular depth estimation even in cases where conventional edges of the image, and the dense map is the depth prediction
cameras fail, e.g., fast-motion and low-light scenarios. at all pixels.
In summary, the main contributions are as follows:
• We propose a novel transformer-based architecture (i.e.,
B. Transformer-Based Monocular Depth Estimation
EReFormer) for event-based monocular depth estimation,
which outperforms state-of-the-art methods in terms of Transformers are applied in frame-based monocular depth
depth map quality by a large margin. estimation tasks [21], [29], [38], [39], [40], [41], [42] by
• We design a gate recurrent vision transformer incorpo- integrating the self-attention mechanism or the full transformer
rating a recursive mechanism into transformers, which as a powerful module. For instance, DPT [21] first leverages
enhances temporal modeling capabilities for event vision transformers instead of CNN-based backbones for
streams while mitigating the costly GPU memory dense depth prediction tasks. Meanwhile, Swin-Depth [39]
requirement. proposes a transformer-based monocular depth estimation
• We present a cross-attention-guided skip connection, method that uses hierarchical representation learning with
which improves global spatial modeling capabilities via linear complexity for images. Subsequently, Depthformer [42]
performing cross-attention to fuse multi-scale features. presents a hybrid CNN-Transformer architecture consisting
To the best of our knowledge, this is the first work of a transformer branch to learn the long-range correlation
to explore such a recurrent transformer to generate dense and a convolution branch to extract the local information.
depth maps using a monocular event camera, which further In addition, some studies [29], [38], [40], [41] adopt
unveils the versatility and transferability of transformers from transformers for self-supervised monocular depth estimation.
conventional frames to continuous event streams. Although above works have achieved finer-grained and more
The rest of this paper is organized as follows. Section II globally coherent predictions than CNN-based methods, these
reviews prior work. Section III formulates the novel problem. transformer-based architectures operate on each isolated image
In Section IV, we explain the details of the proposed so that they do not directly process a continuous stream of
framework. The experimental results, ablations and analysis asynchronous events.
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7419
TABLE I
A L ITERATURE R EVIEW ON M ONOCULAR E VENT-BASED D EPTH E STIMATION
More recently, some event-based vision tasks (e.g., temporal information, it can be formulated as:
event representation [43], video reconstruction [44], event-
D = M(K(S1 ), . . . , K(ST )), (1)
based denoising [45], object tracking [46], and object
recognition [47]) have sought to design transformer-based where the proposed function M can leverage rich temporal
frameworks for better performance. For example, ET-Net [44] cues from event temporal bins, and the parameter T determines
introduces transformers into CNN for event-based video the length of utilizing temporal information.
reconstruction, which effectively models global context via Given the ground-truth depth maps D̄ = D̄1 , . . . , D̄T ,
the transformer-based module. Alkendi et al. [45] develop a we minimize the loss function between the predicted depth
hybrid GNN-Transformer model for event camera denoising. map Dt and the ground-truth D̄t as follows:
CTN [47] presents a hybrid CNN-Transformer network for
M̂ = arg min L M ( D, D̄) ≜ Et∈[1,T ] d Dt , D̄t ,
event-based data classification. However, there are only few (2)
M
explorations for event-based monocular depth estimation tasks.
In this paper, we propose EReFormer, which is a pure where E[·] is an empirical expectation function, d(·, ·) is a
transformer-based architecture to model global spatial context distance metric, e.g., scale-invariant loss.
information and long-range temporal dependencies for event-
based monocular depth estimation. IV. M ETHODOLOGY
This section first gives an overview of our framework. Then,
we present the details of three important components of the
III. P ROBLEM D EFINITION proposed framework: transformer-based encoder-decoder, gate
recurrent vision transformer, and cross-attention-guided skip
Event cameras, such as DVS [48] and DAVIS [49], are
connection. Finally, we give training details of our method.
bio-inspired vision sensors that respond to light changes with
continuous event streams. Each event en can be described
as a four-attribute tuple (xn , yn , tn , pn ). Consequently, asyn- A. Framework Overview
Ne
chronous events S = {en }n=1 are sparse and discrete points This work aims at designing an event-based monoc-
in the spatio-temporal window 0. In general, a continuous ular depth estimator with recurrent transformers, termed
event stream needs to be split into event temporal bins. EReFormer, which can generate high-quality dense depth
Obviously, the temporal correlation lies in adjacent event maps via modeling global spatial context information and
temporal bins [50]. However, the most existing event-based leveraging rich temporal cues. As shown in Fig. 2(a), our
monocular depth estimators [11], [13], running a feed-forward EReFormer mainly consists of three modules: transformer-
frame-based model independently on each event image [51] based encoder-decoder, gate recurrent vision transformer
or voxel grid [11], have not yet leveraged rich temporal cues. (GRViT) module, and cross attention-guided skip connection
In this work, we focus on the knowledge gap and formulate (CASC) module. More precisely, the event stream S is first
this challenging issue called event-based monocular depth split into event temporal bins {S1 , . . . , ST }, and each bin St is
estimation as follows. converted into a 2D image-like representation E t . To provide
Let {S1 , . . . , ST } be event temporal bins separated from a more compelling demonstration of our proposed framework,
a continuous event stream S, where St ∈ RW ×H ×1t is the we utilize event image representation [51] to encode each
t-th event temporal bin with the duration 1t = 50 ms. bin. This choice is motivated by its ease of implementation
To make asynchronous events compatible with deep learning and faster inference speed. In fact, our EReFormer framework
techniques [52], event temporal bins need to convert into event offers a generic interface, allowing for alternative event
embeddings E = {E 1 , . . . , E T } by a kernel function K, where representations to be used as well, providing flexibility and
E t ∈ RW ×H ×Ce is t-th event embedding with the channel adaptability to different scenarios. Then, the transformer-based
number Ce . The goal of our monocular depth estimator is encoder, utilizing swin transformer blocks [27], progressively
to learn a non-linear mapping function M to generate dense extract multi-scale features via the downsampling operation.
depth maps D = {D1 , . . . , DT } by exploiting the spatio- Meanwhile, the GRViT incorporates a recursive mechanism
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7420 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
into Vision Transformer (e.g., ViT [28]) to model long-range To be specific, our GRViT mainly consists of two core parts,
temporal dependencies, which can leverage rich temporal cues namely the attention gate and the update gate. A learnable
from event streams and alleviate the expensive GPU memory positional encoding vector needs to be appended to f t before
cost. To further improve global spatial modeling capabilities, inputting it into the GRViT. The attention gate is utilized to
the CASC is designed as a skip connection to fuse multi- generate the attention feature map At . Firstly, At is added
scale features. Finally, the corresponding decoder predicts to the input f t followed by a feed-forward network (FFN)
fine-grained and globally coherent depth maps {D1 , . . . , DT } with a residual connection, which is used to output the spatio-
using the hierarchical upsampling transformer blocks. temporal feature map fˆt . Secondly, At and h t−1 are passed by
the update gate and output the current hidden state h t .
B. Transformer-Based Encoder-Decoder The attention gate aims at establishing the interaction
between spatial and temporal domains from the current feature
Due to the sparse and discrete attributes of asynchronous
map and the previous hidden state. Firstly, the input of the
events, it is difficult to extract effective global spatial
attention gate is a triplet (i.e., Q t , K t , and Vt ), which can be
information from the local space using CNN-based models.
computed from f t and h t−1 as:
To overcome this challenge, we develop a transformer-based
f
encoder-decoder that models global spatial information from Q t = f t W Q + h t−1 W Qh
event streams for monocular depth estimation. f
K t = f t W K + h t−1 W Kh
1) Transformer Encoder: In order to enhance the global
f
information learning ability under different scale features, Vt = f t WV + h t−1 WVh , (4)
we exploit the widely-used Swin-T [27] as our backbone, f f f
which utilizes the hierarchical attention mechanism to where W Q , W K , WV , W Qh , W Kh , and WVh are learnable
extract features. Specifically, a 2D image-like representation parameters of linear projection layers. Then, a linear attention
E t ∈ RW ×H ×Ce is first split into non-overlapping patches with operation replaces the SoftMax to prevent gradient vanishing,
the size 4 × 4 and then projected to tokens with the dimension and it can be depicted as:
C by a patch embedding layer. Furthermore, all tokens are at = (elu (Q t ) + 1) elu (K t )⊤ + 1 Vt , (5)
input to four transformer layers with different block numbers
(i.e., 2, 2, 6, and 2), and each transformer layer performs the where elu is the ELU activation function. Finally, the attention
downsampling operation to reduce the spatial resolution and feature map At can be obtained by an extension with
increases the channel number with a factor of 2. m independent linear-attention operations and project their
2) Transformer Decoder: As a symmetrical architecture, concatenated outputs as:
the corresponding decoder is also a hierarchical network with h i
four transformer layers. In detail, each layer first increases the At = at1 ; . . . ; atm Wa , (6)
channel number and then decreases the spatial resolution via where Wa denotes a linear layer that is used to project the
the patch-splitting operation. After that, the last transformer attended vector.
layer further refines the feature map dt0 , and a task-specific As a result, the final output spatio-temporal feature map fˆt
head followed by the sigmoid function is implemented to can be formulated as:
predict a dense depth map Dt as the final output.
fˆt = At + f t + FFN (At + f t ) . (7)
C. Gate Recurrent Vision Transformer The update gate determines how much temporal clue will
Temporal transformers have great success in various video be passed to the next time step. f t and h t−1 are concatenated
sequence tasks [32], [33], which efficiently model temporal and passed to a linear projection layer followed by a sigmoid
dependencies in a parallel manner. Nevertheless, one problem function to output the gate Ut , which can be expressed as:
is that these parallel processing temporal transformers require Ut = σ f t ; h t−1 W p ,
(8)
a large GPU memory. Another problem is that the temporal
information extracted from asynchronous events in batch where W p refers to the linear projection layer, and σ (·)
mode is limited. To overcome these problems, we design a indicates the sigmoid activation function.
gate recurrent vision transformer (GRViT) that introduces a In fact, Ut determines how much attended information to
recursive mechanism into Vision Transformer (e.g., ViT [28]), keep and how much temporal information in the previous
which can improve temporal modeling capabilities for event hidden state to discard. Thus, the current hidden state h t can
streams while alleviating the expensive GPU memory cost. be computed as follows:
The overview diagram of the proposed GRViT is shown in h t = (1 − Ut ) ⊙ h t−1 + Ut ⊙ At . (9)
Fig. 2(b). For the current event temporal bin St , our GRViT G
takes the feature map f t and the hidden state h t−1 from the
previous temporal bin as the input, then outputs the current D. Cross Attention-Guided Skip Connection
hidden state h t and the spatio-temporal feature map fˆt , and it Most event-based monocular depth estimators [10], [11],
can be formulated as: [13] adopt the aggregation operation (e.g., ADD or CONCAT)
as a skip connection to fuse multi-scale features. However,
( fˆt , h t ) = G( f t , h t−1 ). (3) these fusion strategies insufficiently exploit global spatial
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7421
Fig. 2. The structure of the proposed event-based monocular depth estimator with recurrent transformers. (a) The overall workflow of our EReFormer. The
event stream is first converted into event embedding [52] and then split into non-overlapping patches. Then, the patches are processed via an encoder-decoder
sub-network with transformer blocks. (b) The proposed GRViT incorporates a recursive mechanism into transformers to leverage temporal cues. (c) The
designed CASC module is presented as a skip connection to fuse multi-scale features.
context information from sparse asynchronous events. Thus, where Dm,t is the metric depth map and Dmax is the maximum
we propose a cross attention-guided skip connection (CASC) observed depth, ϵ is used to map the minimum observed depth
to overcome this problem via cross-attention learning. to 0. In our experiments, Dmax = 80 and ϵ = 3.7.
Our CASC module mainly consists of two core trans- For training losses, we use the scale-invariant loss [53],
former blocks, namely regular window-based and shifted which is defined as:
window-based multi-head self-attention (i.e., WSMA and !2
SWMSA [27]). As illustrated in Fig. 2(c), WSMA and 1X λ X
Lt,si = (Rt (i)) − 2
2
Rt (i) , (12)
SWMSA perform the cross-attention operation with a residual n n
i i
connection, respectively. Take WMSA for instance, we use the
decoded feature map dt to generate query (Q t ), and utilize the where Rt = D̄t − Dt , λ = 0.85 and n is the number
output fˆt of GRViT to generate the key (K t ) and value (Vt ). of valid ground truth pixels i. Follow the practice of
Taking the triplet (i.e, Q t , K t , and Vt ) as the input, our CASC E2Depth [20], we also use a multi-scale scale-invariant
module progressively models spatial contextual information gradient matching loss Lt,grad that encourages smooth depth
and outputs the cross-attention feature map d̄t . Finally, the changes and enforces sharp depth discontinuities in the depth
fused feature map d̂t is obtained by a residual connection to map prediction. Finally, the resulting total loss for a sequence
integrate dt and d̄t . Thus, our CASC module can be formulated of length L is:
as follows: L−1
X
α Lt,si + βLt,grad ,
p
Ltot = (13)
d̃t = WMSA dt , fˆt + FFN WMSA dt , fˆt
t=0
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7422 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
TABLE II
Q UANTITATIVE R ESULTS ON THE MVSEC DATASET
Fig. 3. Representative examples of four test sequences in the MVSEC dataset. The first row to the fourth row corresponds to the outdoor day1, outdoor
night1, outdoor night2, and outdoor night3, respectively. The second column refers to the MegaDepth [20] prediction using the APS frames. Note that,
MegaDepth fails to predict the fine-grained depth map at low-light conditions. Compared with E2Depth+ [10], our EReFormer can achieve more globally
coherent predictions both day and night, which is closer to the ground truth.
Town10 for testing. For the MVSEC dataset, we use outdoor 3) Implementation Details: Our EReFormer is implemented
day2 for training and four sequences (i.e., outdoor day1, and using the Pytorch framework [54]. We use Swin-T [27]
outdoor night1 to outdoor night3) for testing. pre-trained on ImageNet as the backbone to achieve an
2) Evaluation Metrics: To compare different methods, accuracy-speed trade-off. We set the channel number C to 96.
absolute relative error (Abs.Rel.), logarithmic mean squared During training, we use AdamW optimizer [55] with weight-
error (RMSELog), scale-invariant logarithmic error (SILog), decay 0.1 and set the 1-cycle policy [56] for the learning
accuracy (δ < 1.25n , n = 1, 2, 3), average absolute depth rate with max_lr = 3.2 × 10−5 . We train our network for
errors at different cut-off depth distances (i.e., 10m, 20m, 200 epochs with batch size 2. Further, we use the method
and 30m), and running time (ms) are selected as six typical of Truncated Backpropagation Through Time (TBPTT) in
evaluation metrics, which are the most broadly utilized in the training to prevent gradient vanishing or exploding and unroll
depth estimation task. the sequence by 16 steps due to memory limitations. All
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7423
TABLE III
Q UANTITATIVE R ESULTS ON THE DENSE DATASET
Fig. 4. Representative examples of the testing sequence in the DENSE dataset. Obviously, our EReFormer obtains finer-grained and more globally coherent
dense depth maps than the best event-based competitor that utilizes E2Depth+ [10] to process the event stream.
experiments are conducted on NVIDIA Tesla V100-PCLE EReFormer achieves the best performance across the whole
GPUs. test sets, especially the most valuable metric (i.e., Abs.Rel.).
4) Comparisons: To verify the effectiveness of the proposed At the same time, we can see that DPT [21] using vision
approach, we compare our EReFormer with four state-of-the- transformers obtains better performance than the best CNN-
art methods (i.e., E2Depth [10] for voxel grid, DTL− [13] for based method E2Depth+ [10], which proves that utilizing the
event image, E2Depth+ [10] for voxel grid, and DPT [21] global spatial information from sparse events helps predict
for event image). It should be noted that E2Depth+ is more accurate depth map in different scenarios. Although DPT
pretrained on the first 1000 samples in the DENSE dataset has achieved satisfactory results for event-based monocular
and then retrained on both two datasets, which share the dense depth estimation, it is sub-optimal due to not leveraging
same architecture with E2Depth. DTL− selects one branch rich temporal cues from continuous event streams. Compared
of standard DTL [13] to convert each event image into a to the average absolute depth error of 10m, 20m, and 30m
depth map. DPT is an outstanding frame-based monocular with DPT, our EReFormer achieves more accurate depth
depth estimator that adopts vision transformers to process predictions at all distances with an average improvement
each event image. To be fair, we evaluate DTL− and the overall test sequences of 14.8% at 10m, 15.1% at 20m, and
DPT architecture in the same experimental settings as our 9.4% at 30m with respect to values of DPT. In addition,
approach. our EReFormer is almost comparable to the computational
speed of DPT. Overall, it can be concluded that efficient
global sparse spatial modeling and temporal utilization can
B. Main Experiments improve the performance of event-based monocular depth
1) Evaluation on the MVSEC Dataset: As is illustrated estimation. We further present some visualization results on
in Table II, we quantitatively compare our EReFormer with the MVSEC Dataset in Fig. 3. Our EReFormer shows apparent
four state-of-the-art methods on the MVSEC dataset [35]. advantages on the HDR scene when the APS frames (the
All networks predict depth in the logarithmic scale, which second column) fail to predict the correct depth information
is normalized and restored to absolute values by multiplying in low-light conditions. Compared with the E2Depth+, even
by the maximum depth clipped at 80 m. Note that, our if it was trained on both two datasets, our EReFormer
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7424 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
TABLE IV TABLE VI
P ERFORMANCE C OMPONENTS OF O UR ER E F ORMER A BLATING H IDDEN S TATE T RANSFER IN THE GRV I T
TABLE VII
TABLE V
C OMPARISON OF U SING VARIOUS E VENT
C OMPARISON W ITH T YPICAL S KIP C ONNECTION S TRATEGIES R EPRESENTATIONS IN ER E F ORMER
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7425
TABLE VIII
C OMPARISON W ITH VARIOUS E NCODER BACKBONES
TABLE IX TABLE XI
GPU M EMORY C OST A NALYSIS T HE PARAMETERS AND GFLOP S OF D IFFERENT M ETHODS
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7426 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
Fig. 6. Representative visualization results on continuous sequences of the MVSEC dataset. Compared with the feed-forward baseline without GRViT (i.e.,
w/o GRViT), our EReFormer (i.e., w GRViT) performs better and obtains temporal consistent estimation results.
Fig. 7. Representative examples of three motion blur scenarios. The second column refers to the MegaDepth [20] prediction using the blurred APS frames.
Note that, MegaDepth fails to predict the fine-grained depth map at motion blur conditions. Compared with E2Depth+ [10], our EReFormer can achieve more
globally coherent predictions even in motion blur scenarios, closer to the ground truth.
from the previous subsequence serving as the initial hidden first row in Fig. 6. But fortunately, our EReFormer overcomes
state for the next subsequence. We also conduct experiments these problems via leveraging rich temporal cues from the
to investigate the impact of using subsequences of different continuous event stream and performs temporal consistent
lengths for training EReFormer. We adjust the subsequence estimation results.
length to 4, 8, 12, and 16 to evaluate performance. As shown 2) Representative Examples in Motion Blur Scenarios:
in Table XII, it indicates that longer subsequences yield better We further present some visualization results in Fig. 7. RGB
performance. Consequently, due to memory limitations, we set frames fail to predict the fine-grained depth map in high-
the subsequence length of 16 in EReFormer training. speed motion blur scenarios. Much to our surprise, our
EReFormer, inheriting the high temporal resolution and the
HDR properties from DVS events, performs robust depth
D. Scalability Experiments estimation in challenging scenarios. In other words, event
This subsection will present the visualization of the cameras can perform robust monocular depth estimation even
temporal modeling operation, then we further present some in cases where conventional cameras fail, e.g., fast-motion and
representative examples in motion blur scenarios. Finally, low-light scenarios.
we analysis some failure cases from our EReFormer. 3) Failure Case Analysis: Although our EReFormer
1) Visualization of Temporal Modeling: As shown in Fig. 6, achieves satisfactory results even in challenging scenes, some
we further present some comparative visualization results failure cases still remain. As depicted in Fig. 8, the first and
about whether they utilize rich temporal cues or not. The third columns show that the extreme slow-moving scene is
feed-forward baseline, using an event temporal bin, suffers hard to perform high-quality depth prediction. This is because
from some failure cases involving buildings, as shown in the event cameras evidently sense dynamic changes, but they
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7427
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
7428 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 8, AUGUST 2024
[15] H. Rebecq, G. Gallego, E. Mueggler, and D. Scaramuzza, “EMVS: [38] C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation
Event-based multi-view stereo—3D reconstruction with an event camera with a vision transformer,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2022,
in real-time,” Int. J. Comput. Vis., vol. 126, no. 12, pp. 1394–1414, pp. 668–678.
Dec. 2018. [39] Z. Cheng, Y. Zhang, and C. Tang, “Swin-depth: Using transformers and
[16] G. Gallego, M. Gehrig, and D. Scaramuzza, “Focus is all you need: Loss multi-scale fusion for monocular-based depth estimation,” IEEE Sensors
functions for event-based vision,” in Proc. IEEE/CVF Conf. Comput. Vis. J., vol. 21, no. 23, pp. 26912–26920, Dec. 2021.
Pattern Recognit. (CVPR), Jun. 2019, pp. 12272–12281. [40] S. Hwang, S. Park, J. Baek, and B. Kim, “Self-supervised monocular
[17] M. Cui, Y. Zhu, Y. Liu, Y. Liu, G. Chen, and K. Huang, “Dense depth- depth estimation using hybrid transformer encoder,” IEEE Sensors J.,
map estimation based on fusion of event camera and sparse LiDAR,” vol. 22, no. 19, pp. 18762–18770, Oct. 2022.
IEEE Trans. Instrum. Meas., vol. 71, pp. 1–11, 2022. [41] D. Han, J. Shin, N. Kim, S. Hwang, and Y. Choi, “TransDSSL:
[18] H. Cho, J. Jeong, and K.-J. Yoon, “EOMVS: Event-based omnidi- Transformer based depth estimation via self-supervised learning,” IEEE
rectional multi-view stereo,” IEEE Robot. Autom. Lett., vol. 6, no. 4, Robot. Autom. Lett., vol. 7, no. 4, pp. 10969–10976, Oct. 2022.
pp. 6709–6716, Oct. 2021. [42] Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long-
[19] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D range correlation and local information for accurate monocular depth
reconstruction and 6-DoF tracking with an event camera,” in Computer estimation,” 2022, arXiv:2203.14211.
Vision—ECCV 2016: 14th European Conference, Amsterdam, The [43] A. Sabater, L. Montesano, and A. C. Murillo, “Event transformer.
Netherlands, October 11–14, 2016, Proceedings, Part VI 14. Springer, A sparse-aware solution for efficient event data processing,” in Proc.
2016, pp. 349–364. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[20] Z. Li and N. Snavely, “MegaDepth: Learning single-view depth Jun. 2022, pp. 2677–2686.
prediction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. [44] W. Weng, Y. Zhang, and Z. Xiong, “Event-based video reconstruction
Pattern Recognit., Jun. 2018, pp. 2041–2050. using transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
[21] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for Oct. 2021, pp. 2543–2552.
dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [45] Y. Alkendi, R. Azzam, A. Ayyad, S. Javed, L. Seneviratne, and Y. Zweiri,
Oct. 2021, pp. 12179–12188. “Neuromorphic camera denoising using graph neural network-driven
[22] D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, transformers,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 3,
“Combining events and frames using recurrent asynchronous multimodal pp. 4110–4124, Mar. 2024.
[46] J. Zhang et al., “Spiking transformers for event-based single object
networks for monocular depth prediction,” IEEE Robot. Autom. Lett.,
tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2022,
vol. 6, no. 2, pp. 2822–2829, Apr. 2021.
pp. 8801–8810.
[23] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
[47] J. Zhao, S. Zhang, and T. Huang, “Transformer-based domain adaptation
networks for biomedical image segmentation,” in Medical Image
for event data classification,” in Proc. Int. Conf. Acoust., Speech, Signal
Computing and Computer-Assisted Intervention—MICCAI 2015: 18th
Process., 2022, pp. 4673–4677.
International Conference, Munich, Germany, October 5–9, 2015, [48] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 × 128 120 dB 15 µs
Proceedings, Part III 18. Springer, 2015, pp. 234–241. latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-
[24] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, State Circuits, vol. 43, no. 2, pp. 566–576, Feb. 2008.
“Convolutional LSTM network: A machine learning approach for [49] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck,
precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., “A 240 × 180 130 dB 3 µs latency global shutter spatiotemporal vision
vol. 28, 2015, pp. 1–9. sensor,” IEEE J. Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341,
[25] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into Oct. 2014.
convolutional networks for learning video representations,” 2015, [50] J. Li, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Asynchronous
arXiv:1511.06432. spatio-temporal memory network for continuous event-based object
[26] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. detection,” IEEE Trans. Image Process., vol. 31, pp. 2975–2987,
Process. Syst., 2017, pp. 5998–6008. 2022.
[27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using [51] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza,
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), “Event-based vision meets deep learning on steering prediction for self-
Oct. 2021, pp. 10012–10022. driving cars,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
[28] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Jun. 2018, pp. 5419–5427.
for image recognition at scale,” 2020, arXiv:2010.11929. [52] D. Gehrig, A. Loquercio, K. Derpanis, and D. Scaramuzza, “End-to-end
[29] V. Guizilini, R. Ambruş, D. Chen, S. Zakharov, and A. Gaidon, “Multi- learning of representations for asynchronous event-based data,” in Proc.
frame self-supervised depth with transformers,” in Proc. IEEE/CVF IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5633–5643.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 160–170. [53] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
[30] M. S. Junayed, A. Sadeghzadeh, M. B. Islam, L.-K. Wong, and T. Aydin, single image using a multi-scale deep network,” in Proc. Adv. Neural
“HiMODE: A hybrid monocular omnidirectional depth estimation Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [54] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. Adv.
Workshops (CVPRW), Jun. 2022, pp. 5208–5217. Neural Inf. Process. Syst., 2017, pp. 1–4.
[31] Y. Wang, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Learning [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
stereo depth estimation with bio-inspired spike cameras,” in Proc. IEEE in Proc. ICLR, 2014, pp. 1–15.
Conf. Multimedia Express, Jul. 2022, pp. 1–6. [56] L. N. Smith and N. Topin, “Super-convergence: Very fast training of
[32] M. Cao, Y. Fan, Y. Zhang, J. Wang, and Y. Yang, “VDTR: Video neural networks using large learning rates,” Proc. SPIE, vol. 11006,
deblurring with transformer,” IEEE Trans. Circuits Syst. Video Technol., pp. 369–386, May 2019.
vol. 33, no. 1, pp. 160–171, Jan. 2023. [57] N. F. Y. Chen, “Pseudo-labels for supervised learning on dynamic vision
[33] J. Li et al., “Video semantic segmentation via sparse temporal sensor data, applied to object detection under ego-motion,” in Proc.
transformer,” in Proc. ACM Int. Conf. Multimedia., 2021, pp. 59–68. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[34] J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, and D. Yu, “Recurring the Jun. 2018, pp. 644–653.
[58] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High
transformer for video action recognition,” in Proc. IEEE Conf. Comput. speed and high dynamic range video with an event camera,” IEEE
Vis. Pattern Recognit., Jun. 2022, pp. 14063–14073. Trans. Pattern Anal. Mach. Intell., vol. 43, no. 6, pp. 1964–1980,
[35] A. Z. Zhu, D. Thakur, T. Ozaslan, B. Pfrommer, V. Kumar, and Jun. 2021.
K. Daniilidis, “The multivehicle stereo event camera dataset: An event [59] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose,
camera dataset for 3D perception,” IEEE Robot. Autom. Lett., vol. 3, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn.
no. 3, pp. 2032–2039, Jul. 2018. Represent., 2021, pp. 1–26.
[36] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle [60] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal
avoidance for quadrotors with event cameras,” Sci. Robot., vol. 5, no. 40, backpropagation for training high-performance spiking neural networks,”
Mar. 2020, Art. no. eaaz9712. Front. Neurosci., vol. 12, p. 331, May 2018.
[37] A. Mitrokhin, P. Sutor, C. Fermüller, and Y. Aloimonos, “Learning [61] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet:
sensorimotor control with neuromorphic sensors: Toward hyperdimen- Deep learning on point sets for 3D classification and segmentation,”
sional active perception,” Sci. Robot., vol. 4, no. 30, May 2019, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Art. no. eaaw6736. pp. 652–660.
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EVENT-BASED MONOCULAR DEPTH ESTIMATION WITH RECURRENT TRANSFORMERS 7429
Xu Liu received the B.S. degree from the College Yonghong Tian (Fellow, IEEE) is currently the
of Computer Science and Technology, Harbin Dean of the School of Electronics and Computer
Engineering University (HEU), Harbin, China, Engineering, a Boya Distinguished Professor with
in 2019. He is currently pursuing the Ph.D. degree the School of Computer Science, Peking University,
with the Research Center of Intelligent Interface China, and the Deputy Director of Artificial Intelli-
and Human Computer Interaction, Department of gence Research with the Peng Cheng Laboratory,
Computer Science and Technology, Harbin Institute Shenzhen, China. His research interests include
of Technology (HIT), Harbin. His current research neuromorphic vision, distributed machine learning,
interests include neuromorphic vision, deep learning, and multimedia big data. He is the author or
3D vision, and monocular depth estimation. coauthor of over 350 technical articles in refereed
journals and conferences. He was/is an Associate
Editor of IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO
T ECHNOLOGY from January 2018 to December 2021, IEEE T RANSACTIONS
Jianing Li (Member, IEEE) received the B.S. degree ON M ULTIMEDIA from August 2014 to August 2018, IEEE Multimedia
from the College of Computer and Information Magazine from January 2018 to August 2022, and IEEE ACCESS from
Technology, China Three Gorges University, China, January 2017 to December 2021. He co-initiated IEEE International
in 2014, the M.S. degree from the School of Conference on Multimedia Big Data (BigMM). He served as the TPC Co-
Microelectronics and Communication Engineering, Chair for BigMM 2015. He also served as the Technical Program Co-Chair
Chongqing University, China, in 2017, and the Ph.D. for IEEE ICME 2015, IEEE ISM 2015, and IEEE MIPR 2018/2019, and
degree from the National Engineering Research the General Co-Chair for IEEE MIPR 2020 and ICME2021. He is a TPC
Center for Visual Technology, School of Computer Member of more than ten conferences such as CVPR, ICCV, ACM KDD,
Science, Peking University, Beijing, China, in 2022. AAAI, ACM MM, and ECCV. He was a recipient of Chinese National Science
He is the author or coauthor of over 20 technical Foundation for Distinguished Young Scholars in 2018, two National Science
papers in refereed journals and conferences, such and Technology Awards and three ministerial-level awards in China, and
as IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLI - obtained the 2015 EURASIP Best Paper Award for EURASIP Journal on
GENCE, IEEE T RANSACTIONS ON I MAGE P ROCESSING , CVRP, ICCV, and Image and Video Processing, the Best Paper Award of IEEE BigMM 2018,
AAAI. He was honored with the Lixin Tang Scholarship from Chongqing and the 2022 IEEE SA Standards Medallion and SA Emerging Technology
University, China, in 2016. He received the Outstanding Research Award, Award. He is a Senior Member of CIE and CCF and a member of ACM.
Peking University, in 2020. His research interests include event-based vision,
neuromorphic engineering, and robot learning.
Authorized licensed use limited to: National Taiwan University. Downloaded on April 21,2025 at 22:10:21 UTC from IEEE Xplore. Restrictions apply.