StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Zhiheng Li, Yubo Cui, Jiexi Zhong, Zheng Fang This work was supported by the National Natural Science Foundation of China under Grants 62073066, the Fundamental Research Funds for Central Universities under Grant N2226001, and 111 Project under Grant B16009. (Corresponding author: Zheng Fang, e-mail: [email protected])The authors are all with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China;
Abstract

Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may cause inconsistent segmentation results for the same object in different frames. To overcome this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior of moving objects and adopted to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine the present forecast at voxel and instance levels through voting. Besides, we present multi-view encoder with cascade projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. Code will be released at https://github.com/NEU-REAL/StreamMOS.git.

I INTRODUCTION

In urban roads, there are often many dynamic objects with variable trajectories, such as vehicles and pedestrians, which create the collision risk for autonomous vehicles. Meanwhile, the moving objects will cause errors in simultaneous localization and mapping (SLAM) [1] as well as bringing challenges for obstacle avoidance [2] and path planning [3]. As a result, online moving object segmentation (MOS) based on LiDAR points has become a crucial task in multiple fields. However, owing to the unordered and sparsity natures of LiDAR points, MOS still faces some challenging cases, especially difficulty in perceiving moving objects with sparse points at a distance.

To tackle the above problem, the mainstream strategy is to exploit spatio-temporal information from LiDAR sequences. For instance, Chen et al. [4] generate residual image in range view (RV), which reflects spatial position of dynamic objects in each frame and can be utilized to perform temporal fusion to predict moving objects. Following the RV-based projection in [4], Sun et al. [5] adopt motion-guided attention to better explore temporal motion cues from residual images. Besides, some works [6, 7] attempt to map point clouds on bird’s eye view (BEV) and ensure consistent object size and movement. Recently, Wang et al. [8] process LiDAR sequences directly via 4D convolution to construct temporal associations while adding instance detection to promote segmentation integrity.

Refer to caption
Figure 1: Pipeline comparison of moving object segmentation approaches. We compare the structure of proposed StreamMOS with previous methods in (a) and (b). Meanwhile, the segmentation results obtained by our method achieve better spatial integrity and temporal continuity in (c).

However, as displayed in Fig. 1(a), these methods focus on temporal fusion in a single inference and make independent predictions for each frame, leading to inconsistent results for the same object at different moments (in Fig. 1(c)). Despite Mersch et al. [9] leveraging a binary Bayes filter to combine multiple predictions, it still ignores information transmission at feature level, which supplies rich spatial context to the next inference. Thus, we present a “streaming” structure as shown in Fig. 1(b), which regards historical feature as a strong prior and exploits it to guide the current inference. Meanwhile, the past predictions are stored in long-term memory and utilized to suppress false predictions. In this way, we construct robust correlations in multiple inferences and fully explore temporal information to ensure consistent results in different frames.

To implement the idea of streaming, we propose a moving object segmentator, called StreamMOS, which encodes object motion cues from multi-view and adopts dual-span memory to transfer historical information. Specifically, different from previous works that map point clouds on one view, we argue that various viewpoints provide more holistic observations of dynamic objects. Thus, we propose a multi-view encoder that applies a cascade structure to iteratively get dense appearance from RV and perceive intuitive motion on BEV, resulting in more distinguishable features of dynamic objects (Fig. 3(b)). Meanwhile, during BEV encoding, we introduce asymmetric convolution with decoupled strategy to better capture vertical and horizontal motion. Then, we use attention mechanism to implement temporal fusion that aligns features from different times and conveys spatial prior to current inference. Besides, due to the inherent uncertainty of neural networks, the output of segmentation decoder may be inconsistent across frames (Fig. 1(c)). To solve this issue, we propose voting mechanism as post-processing to optimize predicted labels. Its core idea is to statistically analyze long-term motion states at the voxel instance levels, and then select the most likely state to update raw point-wise forecasts. In this way, the previous results can be used to refine current predictions, enhancing the temporal continuity and spatial completeness of segmentation together.

In sammary, the contributions of our work are as follows:

  • We present a novel streaming framework called StreamMOS, which exploits short-term and long-term memory to construct associations among inferences and improve the integrity and continuity of predictions in MOS task.

  • We propose a multiple projection architecture to capture the object motion and complete appearance from multi-view. We also present a multi-level voting mechanism to refine segmentation results for every voxel and instance.

  • The extensive experiments confirm that our StreamMOS outperforms previous algorithms on the SemanticKITTI (77.8%) and Sipailou Campus (92.5%) while running in real-time. Our code will be available to the community.

II RELATED WORK

II-A Geometric-based Algorithms

The initial LiDAR-based MOS methods could be referred to as the geometric-based approaches, which typically build the map in advance and remove any dynamic objects through estimating occupancy probability and determining visibility. For example, Schauer et al. [10] proposed a ray casting-based approach that counted the hits and misses of scans to update the occupancy situation of the grid map. Afterwards, Pagad et al. [11] constructed an occupancy octree map and proposed a probability update mechanism to obtain clean point clouds by considering the occupancy history. Despite getting promising results, [10, 11] suffer extensive computational burden due to the ray casting and updating voxel one by one. To improve efficiency, several visibility-based [12, 13, 14] algorithms have been developed. Pomerleau et al. [12] identified moving objects by checking whether the points of the pre-built map are occluded by the points in the query frame. Meanwhile, to avoid mismarked ground points as dynamic reported in [12], Kim et al. [13] retained ground points from removed points using a multi-resolution reverting algorithm. Moreover, Lim et al. [15] introduced a visibility-free approach that removed moving traces by computing pseudo occupancy ratio between the query scan and submap in each grid. Although the above methods can distinguish the motion state of objects and clean maps well, they are often performed offline due to requiring a prior map and may be not suitable for real-time applications.

II-B Learning-based Algorithms

Recently, many studies have focused on utilizing learning-based approaches to eliminate dynamic objects online, which only take consecutive frame point clouds as input rather than a pre-built map. Meanwhile, according to data representation, these approaches could be grouped into projection-based and point-based methods. The former converts point clouds into bird’s eye view (BEV) or range view (RV) images, while the latter processes 3D raw points directly.

Specifically, for point-based algorithms, Mersch et al. [9] adopted sparse 4D convolutions to process a series of LiDAR scans and predicted moving objects in each frame. They also employed a binary Bayes filter to fuse multiple predictions in a sliding window. Subsequently, Kreutz et al. [16] proposed an unsupervised approach to address MOS task in stationary LiDAR and viewed it as a multivariate time series clustering problem. Lately, Wang et al. [8] introduced InsMOS to unify detection and segmentation of moving objects into a network, so that the instance cues can be used to improve segmentation integrity. Although they achieved promising performance, the feature extraction of numerous points in [8] may cause high computational costs.

Compared to the mentioned approaches, projection-based algorithms [4, 5, 17, 6, 7] are generally more efficient owing to handling ordered and dense data. For instance, Chen et al. [4] mapped LiDAR scans into spherical coordinates and generated residual images to extract dynamic information in sequence. Sun et al. [5] designed a dual-branch to explore the spatial-temporal information and relieved boundary blurring problem by a point refinement module. Furthermore, Kim et al. [17] achieved higher performance by using extra semantic features. In contrast to range projection, Mohapatra et al. [6] and Zhou et al. [7] utilized BEV projection to obtain a more intuitive motion representation but the serious loss of spatial information still limited their performance. Thus, to address this issue, our StreamMOS exploits a multi-view encoder to capture object motion from BEV and RV in a series manner, which not only allows for complete observation of objects but also alleviates computational effort. Meanwhile, we construct memory banks to pass past knowledge to current inference, resulting in consistent segmentation across a long sequence.

Refer to caption
Figure 2: The overall architecture of StreamMOS. (a) Feature encoder adopts a points-wise encoder to extract point features and project them into BEV. Then, the multi-view encoder with cascade structure and asymmetric convolution is employed to extract motion features from different views. (b) Temporal fusion utilizes an attention module to propagate memory feature into the current inference. (c) Segmentation decoder with parameter-free upsampling adopts multi-scale features to predict class labels. (d) Voting mechanism exploits memory predictions to optimize the motion state of each 3D voxel and instance.

III Methodology

III-A Framework Overview

LiDAR-based MOS aims to determine the motion state of each point in the current scan based on the multi-frame point clouds {𝒫tn}n=0Nsuperscriptsubscriptsubscript𝒫𝑡𝑛𝑛0𝑁\{\mathcal{P}_{t-n}\}_{n=0}^{N}{ caligraphic_P start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. To this end, existing methods first adopt the relative pose transformations {𝒯tnt}n=1Nsuperscriptsubscriptsubscript𝒯𝑡𝑛𝑡𝑛1𝑁\{\mathcal{T}_{t-n\rightarrow t}\}_{n=1}^{N}{ caligraphic_T start_POSTSUBSCRIPT italic_t - italic_n → italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT provided by the LiDAR odometry to project the history scans {𝒫tn}n=1Nsuperscriptsubscriptsubscript𝒫𝑡𝑛𝑛1𝑁\{\mathcal{P}_{t-n}\}_{n=1}^{N}{ caligraphic_P start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT into ego car coordinate system of the current scan 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and get {𝒫tn}n=1Nsuperscriptsubscriptsubscriptsuperscript𝒫𝑡𝑛𝑛1𝑁\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}{ caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Then, they usually feed 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and {𝒫tn}n=1Nsuperscriptsubscriptsubscriptsuperscript𝒫𝑡𝑛𝑛1𝑁\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}{ caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT into a network ΨΨ\Psiroman_Ψ to fuse spatio-temporal information and predict classification results tV×3subscript𝑡superscript𝑉3\mathcal{M}_{t}\in\mathbb{R}^{V\times 3}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 3 end_POSTSUPERSCRIPT of all points in 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where V×3𝑉3V\times 3italic_V × 3 refers to the probability that V𝑉Vitalic_V points belong to three categories, including unknown, static and moving states.

Different from previous approaches that focus on temporal fusion in a single inference, we extra consider the association among multiple inferences and leverage history feature t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and predictions {tm}m=1Msuperscriptsubscriptsubscript𝑡𝑚𝑚1𝑀\{\mathcal{M}_{t-m}\}_{m=1}^{M}{ caligraphic_M start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT to raise the quality of current inference. Thus, our method formulates MOS task as follows:

t=Ψ(𝒫t,{𝒫tn}n=1N,t1,{tm}m=1M)subscript𝑡Ψsubscript𝒫𝑡superscriptsubscriptsubscriptsuperscript𝒫𝑡𝑛𝑛1𝑁subscript𝑡1superscriptsubscriptsubscript𝑡𝑚𝑚1𝑀\displaystyle\mathcal{M}_{t}=\Psi(\mathcal{P}_{t},\{\mathcal{P}^{\prime}_{t-n}% \}_{n=1}^{N},\mathcal{H}_{t-1},\{\mathcal{M}_{t-m}\}_{m=1}^{M})caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ ( caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , { caligraphic_M start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) (1)

where N,M𝑁𝑀N,Mitalic_N , italic_M are the number of historical LiDAR frames and forecasts. Meanwhile, the details of our network are shown in Fig. 2. Specifically, given a series of scans, our StreamMOS first utilizes a multi-view encoder to capture the motion cues from the viewpoints of BEV and RV. Thereafter, we can get a motion feature tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that reflects spatial information of moving objects in the current frame. Then, we use a temporal fusion module to combine tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with historical feature t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT retained in short-term memory. By doing this, some prior information can be transferred to the current inference and further used to decode movable objects 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as well as coarse motion state 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for all points. Finally, we apply a voting mechanism to update 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with historical results {tm}m=1Msuperscriptsubscriptsubscript𝑡𝑚𝑚1𝑀\{\mathcal{M}_{t-m}\}_{m=1}^{M}{ caligraphic_M start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT stored in long-term memory and instance information derived from 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, thereby yielding the refined prediction tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

III-B Multi-Projection Feature Encoder

III-B1 Preliminaries

Unlike the existing methods that project point clouds into a single view, such as BEV [7] or RV [17], we believe that mapping points to these views simultaneously could capture more complete appearance and obvious motion cues of dynamic objects. Meanwhile, as shown in the bottom of Fig. 2, the points could be considered as the intermediate carrier to transfer information between different perspectives. To achieve these, we use the Point-to-BEV (P2B) and Point-to-Range (P2R) to project point features into 2D plane while using the BEV-to-Point (B2P) and Range-to-Point (R2P) to gather point features from multi-view. Specifically, assuming that the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT 3D point in the 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is noted as pk3D=(xk,yk,zk)superscriptsubscript𝑝𝑘3𝐷subscript𝑥𝑘subscript𝑦𝑘subscript𝑧𝑘p_{k}^{3D}=(x_{k},y_{k},z_{k})italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), the P2B project it into a rectangular 2D grid and obtain its coordinate (ukb,vkb)superscriptsubscript𝑢𝑘𝑏superscriptsubscript𝑣𝑘𝑏(u_{k}^{b},v_{k}^{b})( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) in BEV. For the P2R, the point pk3Dsuperscriptsubscript𝑝𝑘3𝐷p_{k}^{3D}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT with 3D cartesian coordinate is converted into spherical coordinate pksph=(rk,θk,ϕk)superscriptsubscript𝑝𝑘𝑠𝑝subscript𝑟𝑘subscript𝜃𝑘subscriptitalic-ϕ𝑘p_{k}^{sph}=(r_{k},\theta_{k},\phi_{k})italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p italic_h end_POSTSUPERSCRIPT = ( italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and assigned to the 2D grid in RV with coordinate (ukr,vkr)superscriptsubscript𝑢𝑘𝑟superscriptsubscript𝑣𝑘𝑟(u_{k}^{r},v_{k}^{r})( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) [18], where rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represent distance, zenith and azimuth angle of point pk3Dsuperscriptsubscript𝑝𝑘3𝐷p_{k}^{3D}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT. The points falling into the same grid undergo max-pooling to aggregate features. For R2P and B2P, the grid features of RV and BEV are allocated to 3D points using bilinear interpolation within nearby grids.

III-B2 Network Structure

In the feature encoder, we first use a lightweight PointNet [19] as point-wise encoder to process point clouds (𝒫t,{𝒫tn}n=1Nsubscript𝒫𝑡superscriptsubscriptsubscriptsuperscript𝒫𝑡𝑛𝑛1𝑁\mathcal{P}_{t},\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT) and obtain nV×C(n{tN,,t})subscript𝑛superscript𝑉𝐶𝑛𝑡𝑁𝑡\mathcal{E}_{n}\in\mathbb{R}^{V\times C}\ (n\in\{t-N,...,t\})caligraphic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_C end_POSTSUPERSCRIPT ( italic_n ∈ { italic_t - italic_N , … , italic_t } ), where C𝐶Citalic_C means the number of channel. Then, for the feature of each frame, we adopt P2B to project them into BEV and concentrate them along the channel dimension to get BEV feature 𝒢t0Wb×Hb×(N+1)Csuperscriptsubscript𝒢𝑡0superscriptsuperscript𝑊𝑏superscript𝐻𝑏𝑁1𝐶\mathcal{G}_{t}^{0}\in\mathbb{R}^{W^{b}\times H^{b}\times(N+1)C}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × ( italic_N + 1 ) italic_C end_POSTSUPERSCRIPT, where Wb,Hbsuperscript𝑊𝑏superscript𝐻𝑏W^{b},H^{b}italic_W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT are the predefined width and height of BEV. Afterwards, we feed 𝒢t0superscriptsubscript𝒢𝑡0\mathcal{G}_{t}^{0}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT into multi-view encoder (MVE) to extract temporal information and capture object motion from different views.

Refer to caption
Figure 3: Illustration of Asymmetric Conv Block and multi-view features.

In the lower part of Fig. 2, after downsampling BEV feature 𝒢tl(l{0,1})superscriptsubscript𝒢𝑡𝑙𝑙01\mathcal{G}_{t}^{l}(l\in\{0,1\})caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_l ∈ { 0 , 1 } ), we introduce an asymmetric convolution block (ACB) to perceive the movement of objects. As shown in Fig. 3(a), compared to the typical symmetric convolutional kernel (e.g. 3×\times×3), the kernel size of ACB has one side longer (e.g. 3×\times×5 and 5×\times×3). Besides, it decouples feature extraction into the horizontal and vertical directions, defined as follows:

f=Conv3×3(Convh(f)Convv(f))+fsuperscript𝑓subscriptConv33direct-productsubscriptConv𝑓subscriptConv𝑣𝑓𝑓\displaystyle f^{\prime}=\text{Conv}_{3\times 3}(\text{Conv}_{h}(f)\odot\text{% Conv}_{v}(f))+fitalic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( Conv start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_f ) ⊙ Conv start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_f ) ) + italic_f (2)

where f𝑓fitalic_f and direct-product\odot are feature map and concatenation operation. ConvhsubscriptConv\text{Conv}_{h}Conv start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and ConvvsubscriptConv𝑣\text{Conv}_{v}Conv start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT mean asymmetric convolutions, which can expand the receptive field and improve perception ability for dynamic objects since they usually have obvious motion in a certain direction. After that, as displayed in Fig. 2, we apply B2P and P2R to project BEV feature 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT into range view and use convolution layers to generate another motion feature 𝒢rsuperscript𝒢𝑟\mathcal{G}^{r}caligraphic_G start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, which is further remapped into BEV and fuse with 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. Thus, complete motion cues can be encoded by cascade projection.

Thereafter, we stack two MVEs with a BEV-view encoder to get discriminative motion feature tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specially, we visualize multi-view features of different MVE layers in Fig. 3(b). It proves that MVE can extract consistent object information across various perspectives, while the deeper layer is capable of suppressing noise and preserving clearer motion features.

III-C Short-term Temporal Fusion

The purpose of this part is to transfer the memory feature t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from the last inference to the present, so that historical spatial states of objects can be reused to guide the network to deduce object motion in the t𝑡titalic_t time. In this regard, we first build short-term memory bank as a bridge to store t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and connect adjacent inference. Then, since tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are not in the same coordinate system, we use learnable offsets [20] to adaptively find the relationship between two features and combine them by attention weight. Specifically, the t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is first fed into two linear layers to produce K𝐾Kitalic_K attention weights Aksubscript𝐴𝑘A_{k}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and sampling offsets ΔgkΔsubscript𝑔𝑘\Delta g_{k}roman_Δ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Afterward based on the offsets ΔgkΔsubscript𝑔𝑘\Delta g_{k}roman_Δ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and coordinates gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of reference points in tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a bilinear interpolation is used to gather reference values Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Finally, Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is weighted by Aksubscript𝐴𝑘A_{k}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and get a updated feature ^tsubscript^𝑡{\hat{\mathcal{H}}_{t}}over^ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The above process can be formulated as follows:

Ak=Softmax(Linear(t1)),Δgk=Linear(t1)formulae-sequencesubscript𝐴𝑘SoftmaxLinearsubscript𝑡1Δsubscript𝑔𝑘Linearsubscript𝑡1A_{k}=\text{Softmax}(\text{Linear}(\mathcal{H}_{t-1})),\ \Delta g_{k}=\text{% Linear}(\mathcal{H}_{t-1})italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Softmax ( Linear ( caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) , roman_Δ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Linear ( caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (3)
Gk=S(t,gk+Δgk)subscript𝐺𝑘𝑆subscript𝑡subscript𝑔𝑘Δsubscript𝑔𝑘G_{k}=S(\mathcal{F}_{t},g_{k}+\Delta g_{k})italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_S ( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (4)
^t=l=1LWl(k=1KAlkGlk)subscript^𝑡superscriptsubscript𝑙1𝐿subscript𝑊𝑙superscriptsubscript𝑘1𝐾subscript𝐴𝑙𝑘subscript𝐺𝑙𝑘{\hat{\mathcal{H}}_{t}}=\sum_{l=1}^{L}W_{l}(\sum_{k=1}^{K}A_{lk}\cdot G_{lk})over^ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT ⋅ italic_G start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT ) (5)

where L,K𝐿𝐾L,Kitalic_L , italic_K are the number of attention heads and reference points, respectively. The S(··)𝑆··S(\textperiodcentered\cdot\textperiodcentered)italic_S ( · ⋅ · ) and Wlsubscript𝑊𝑙W_{l}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represent the bilinear sampling and learnable weight of multi-head attention. Later, the ^tsubscript^𝑡{\hat{\mathcal{H}}_{t}}over^ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is processed by normalization layer and feed-forward network (FFN) to generate a renewed tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the current time:

~t=LN(^t+t1),t=LN(FFN(~t)+~t)formulae-sequencesubscript~𝑡LNsubscript^𝑡subscript𝑡1subscript𝑡LNFFNsubscript~𝑡subscript~𝑡\tilde{\mathcal{H}}_{t}=\text{LN}(\hat{\mathcal{H}}_{t}+\mathcal{H}_{t-1}),\ % \mathcal{H}_{t}=\text{LN}(\text{FFN}(\tilde{\mathcal{H}}_{t})+\tilde{\mathcal{% H}}_{t})over~ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = LN ( over^ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = LN ( FFN ( over~ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + over~ start_ARG caligraphic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (6)

Here, LN is the layer normalization. And the tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will replace t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT of short-term memory and participate in the prediction of segmentation decoder.

III-D Reduced-Parameter Segmentation Decoder

To distinguish the static and dynamic points, the previous methods [7, 5, 8] usually leverage a UNet-like decoder to upsample multi-scale features progressively by convolutions. However, to reduce complexity and storage costs of network, we introduce a lightweight decoder, which first employs bilinear interpolation to convert the size of multi-scale features 𝒢ti(i1,2)superscriptsubscript𝒢𝑡𝑖𝑖12\mathcal{G}_{t}^{i}\ (i\in{1,2})caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_i ∈ 1 , 2 ) and tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the uniform height Hb/2superscript𝐻𝑏2H^{b}/2italic_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT / 2 and width Wb/2superscript𝑊𝑏2W^{b}/2italic_W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT / 2. For each upsampled feature, we employ an auxiliary head to predict moving objects in BEV and exploit auxiliary loss for constraints, which could guarantee that features from multi-scale are aligned and decoded well. Thereafter, we use B2P to convert upsampled features into point features one by one and concatenate them to get ^tV×Dsubscript^𝑡superscript𝑉𝐷\mathcal{\hat{E}}_{t}\in\mathbb{R}^{V\times D}over^ start_ARG caligraphic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_D end_POSTSUPERSCRIPT. Finally, except decoding coarse motion states 𝒞tV×3subscript𝒞𝑡superscript𝑉3\mathcal{C}_{t}\in\mathbb{R}^{V\times 3}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 3 end_POSTSUPERSCRIPT of LiDAR points 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, point-wise decoder extra predicts probability 𝒪tV×2subscript𝒪𝑡superscript𝑉2\mathcal{O}_{t}\in\mathbb{R}^{V\times 2}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 2 end_POSTSUPERSCRIPT that points belong to movable objects (e.g. cars, bicycles) and static backgrounds, such as building, parking and road:

𝒞t=MLP1(t^t),𝒪t=MLP2(t^t)formulae-sequencesubscript𝒞𝑡subscriptMLP1direct-productsubscript𝑡subscript^𝑡subscript𝒪𝑡subscriptMLP2direct-productsubscript𝑡subscript^𝑡\mathcal{C}_{t}=\text{MLP}_{1}(\mathcal{E}_{t}\odot\mathcal{\hat{E}}_{t}),\ % \mathcal{O}_{t}=\text{MLP}_{2}(\mathcal{E}_{t}\odot\mathcal{\hat{E}}_{t})caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = MLP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ over^ start_ARG caligraphic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = MLP start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ over^ start_ARG caligraphic_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (7)

where MLP1,MLP2subscriptMLP1subscriptMLP2\text{MLP}_{1},\text{MLP}_{2}MLP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , MLP start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT mean multi-layer perceptrons that do not share weights, and tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the feature from point-wise encoder. According to discrete classification labels 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can acquire the attributes of instance, like location and size, by clustering and apply them to optimize 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the subsequent voting stage.

III-E Long-term Voting Mechanism

Most existing approaches [4, 5, 17] focus on improving the quality of a single inference through modifying network structure. Nevertheless, in light of the inexplicability and data dependency of neural networks, this strategy may be limited. For example, for a parking car shown in Fig. 1(c), the model may predict it as stationary at this frame and dynamic at other frames. Meanwhile, due to lacking instance-level perception ability, the network may generate inconsistent results for the different parts of an object. To solve these issues, we present a voting mechanism consisting of voxel-based voting (VBV) and instance-based voting (IBV), which could be regarded as a post-processing that corrects errors in the current predicted labels 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using historical results {tm}m=1Msubscriptsuperscriptsubscript𝑡𝑚𝑀𝑚1\{\mathcal{M}_{t-m}\}^{M}_{m=1}{ caligraphic_M start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT and movable labels 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Noted that tm(m=1,,M)subscript𝑡𝑚𝑚1𝑀\mathcal{M}_{t-m}\ (m=1,...,M)caligraphic_M start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT ( italic_m = 1 , … , italic_M ) with coordinate of 𝒫tmsubscript𝒫𝑡𝑚\mathcal{P}_{t-m}caligraphic_P start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT will be transformed into coordinate system of 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to yield tmsubscriptsuperscript𝑡𝑚\mathcal{M}^{\prime}_{t-m}caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT by pose transformations 𝒯tmtsubscript𝒯𝑡𝑚𝑡\mathcal{T}_{t-m\rightarrow t}caligraphic_T start_POSTSUBSCRIPT italic_t - italic_m → italic_t end_POSTSUBSCRIPT in advance.

III-E1 Voxel-based voting

Refer to caption
Figure 4: The detail of our voting mechanism. It utilizes voxel-based voting (VBV) and instance-based voting (IBV) to refine coarse predictions.

Inspired by TFNet [21], we obtain the historical predictions {tm}m=1Msubscriptsuperscriptsubscriptsuperscript𝑡𝑚𝑀𝑚1\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1}{ caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT and current forecast 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the same coordinate system. Afterward we divide points 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into voxels with fixed size and fill (𝒞t,{tm}m=1M)subscript𝒞𝑡subscriptsuperscriptsubscriptsuperscript𝑡𝑚𝑀𝑚1(\mathcal{C}_{t},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1})( caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT ) into each voxel. Then, as shown in Fig. 4(a), the most frequently predicted label acts as motion state for all points in the same voxel and incorrect labels will be updated. We brief the above procedure of VBV as: Ω(𝒞t,{tm}m=1M)𝒞t^maps-toΩsubscript𝒞𝑡subscriptsuperscriptsubscriptsuperscript𝑡𝑚𝑀𝑚1^subscript𝒞𝑡\Omega(\mathcal{C}_{t},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1})\mapsto\hat{% \mathcal{C}_{t}}roman_Ω ( caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT ) ↦ over^ start_ARG caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

III-E2 Instance-based voting

Although VBV can ensure the consistency of motion states within a local area and achieve performance improvement in Tab. IV, it is difficult to achieve instance-level state unity, as shown output from VBV 𝒞t^^subscript𝒞𝑡\hat{\mathcal{C}_{t}}over^ start_ARG caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG in Fig. 4(b). To solve this, we propose an instance-based voting based on cluster. Specifically, given predicted probability 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from decoder, we can pick out the foreground points 𝒫tsubscriptsuperscript𝒫𝑡{\mathcal{P}^{\prime}_{t}}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from 𝒫tsubscript𝒫𝑡{\mathcal{P}_{t}}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and adopt DBSCAN [22] to split 𝒫tsubscriptsuperscript𝒫𝑡{\mathcal{P}^{\prime}_{t}}caligraphic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into S𝑆Sitalic_S clusters. Then, according to the coordinates of points in each cluster, we can compute S𝑆Sitalic_S minimum 3D bounding boxes to cover all objects. Thus, we can further crop out instance-level predictions from 𝒞t^^subscript𝒞𝑡\hat{\mathcal{C}_{t}}over^ start_ARG caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and memory predictions {tm}m=1Msubscriptsuperscriptsubscriptsuperscript𝑡𝑚𝑀𝑚1\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1}{ caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT. Finally, similar to voxel-based voting, we adopt the class label with the highest quantity as the motion state for all points in the instance and get the final prediction as: Φ(𝒞t^,{tm}m=1M,𝒪t)tmaps-toΦ^subscript𝒞𝑡subscriptsuperscriptsubscriptsuperscript𝑡𝑚𝑀𝑚1subscript𝒪𝑡subscript𝑡\Phi(\hat{\mathcal{C}_{t}},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1},\mathcal{O% }_{t})\mapsto{\mathcal{M}_{t}}roman_Φ ( over^ start_ARG caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , { caligraphic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ↦ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

III-E3 Long-term memory updating

We construct a memory bank with the length of M𝑀Mitalic_M to retain historical results. When a new refined prediction tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is output from voting mechanism, we append it to memory bank and pull out the oldest result.

As a result, compared to only relying on network adaptive learning, our vote mechanism could explicitly suppress incorrect prediction and improve the consistency of segmentation by analyzing long-term prediction in voxel and instance level.

III-F Loss Functions

To ensure the network can be optimized fully, we separate the training process into two phases. In the 1st stage, we only train our network without predicting movable objects in the decoder. Meanwhile, following the previous works [7, 5], we introduce the weighted cross-entropy (Lwcesubscript𝐿𝑤𝑐𝑒L_{wce}italic_L start_POSTSUBSCRIPT italic_w italic_c italic_e end_POSTSUBSCRIPT) and Lovász-Softmax (Llssubscript𝐿𝑙𝑠L_{ls}italic_L start_POSTSUBSCRIPT italic_l italic_s end_POSTSUBSCRIPT[23] losses to supervise network:

L=λ1Lwce+λ2Lls,Ls1=L(y,y^)+λ3i=13L(yib,y^ib)formulae-sequence𝐿subscript𝜆1subscript𝐿𝑤𝑐𝑒subscript𝜆2subscript𝐿𝑙𝑠subscript𝐿𝑠1𝐿𝑦^𝑦subscript𝜆3superscriptsubscript𝑖13𝐿superscriptsubscript𝑦𝑖𝑏superscriptsubscript^𝑦𝑖𝑏L=\lambda_{1}L_{wce}+\lambda_{2}L_{ls},\ L_{s1}=L(y,\hat{y})+\lambda_{3}\sum_{% i=1}^{3}L(y_{i}^{b},\hat{y}_{i}^{b})italic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_w italic_c italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_l italic_s end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT = italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_L ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) (8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT mean the weights for losses while y𝑦yitalic_y and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG are the ground truth and predicted results of points. L(yib,y^ib)𝐿superscriptsubscript𝑦𝑖𝑏superscriptsubscript^𝑦𝑖𝑏L(y_{i}^{b},\hat{y}_{i}^{b})italic_L ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) denote the auxiliary losses for BEV predictions. Moreover, in the 2nd stage, we freeze the pre-trained parameters that are optimized in the 1st stage and only train the rest of network to estimate movable objects by loss function as follows:

Ls2=λ1Lwce(x,x^)+λ2Lls(x,x^)subscript𝐿𝑠2subscript𝜆1subscript𝐿𝑤𝑐𝑒𝑥^𝑥subscript𝜆2subscript𝐿𝑙𝑠𝑥^𝑥\ L_{s2}=\lambda_{1}L_{wce}(x,\hat{x})+\lambda_{2}L_{ls}(x,\hat{x})italic_L start_POSTSUBSCRIPT italic_s 2 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_w italic_c italic_e end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_l italic_s end_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_x end_ARG ) (9)

where x𝑥xitalic_x and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG represent ground-truth labels and predictions for movable objects.

Refer to caption
Figure 5: Visualization MOS results on SemanticKITTI validation set. Incorrect predictions are marked by blue circles. Best viewed in color and zoom.

IV EXPERIMENTS

IV-A Experimental Settings

Datasets. On SemanticKITTI-MOS [4] dataset and Sipailou-Campus [7] dataset, we compare segmentation performance with previous methods and conduct extensive ablation studies. The SemanticKITTI-MOS dataset is collected by Velodyne HDL-64E LiDAR and contains a total of 22 sequences with labeled point clouds that are remapped from 28 semantic classes into 3 types of motion states. Following the previous algorithms [7, 5, 8], we divide the sequences 00-07, 09-10 (19,130 frames) for training, sequence 08 (4,071 frames) for validation and sequences 11-21 (20,351 frames) for testing. For Sipailou-Campus that is constructed based on solid-state LiDAR, we follow the implementation of [7] to split dataset into 5 train sequences (16,887 frames), 1 validation sequence (3,191 frames) and 2 test sequences (6,201 frames).

Evaluation Metric. Consistent with present approaches [8, 5], we employ the Jaccard Index or Intersection-over-Union (IoU) metric [24] over dynamic objects to measure the MOS performance, which can be denoted as:

IoU=TPTP+FP+FNIoUTPTPFPFN\text{IoU}=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}}IoU = divide start_ARG TP end_ARG start_ARG TP + FP + FN end_ARG (10)

where TP, FP, and FN mean the number of true positive, false positive, and false negative predictions for dynamic category.

IV-B Implementation Details

In data processing, we adopt widely-used data augmentation, such as random rotation, flipping and slight translation to enrich the training data, which plays an important role in improving model generalization. Meanwhile, as mentioned in Sec. III-F, we optimize the network using two-stage training strategy. For the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT stage, we train the model for 48 epochs on NVIDIA RTX 4090 GPUs using an SGD optimizer with an initial learning rate of 0.02, which is decayed by 0.1 every 10 epochs. For the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT stage, we solely optimize the network for 10 epochs with a learning rate of 0.02. Furthermore, each LiDAR scan is limited to [-50m𝑚mitalic_m, 50m𝑚mitalic_m] for the X and Y axes and [-4m𝑚mitalic_m, 2m𝑚mitalic_m] for the Z axis. The number of points in each scan is randomly downsampled or filled to V𝑉Vitalic_V = 1.3 ×\times× 105.

IV-C Quantitative Results

TABLE I: Performance comparison on SemanticKITTI validation and test sets. * denotes the method exploiting semantic labels. † means the method is trained both on SemanticKITTI and KITTI-road datasets.
Methods Source IoU (Val) IoU (Test)
KPConv ICCV 19 - 60.9
SpSequenceNet CVPR 20 - 43.2
LiMoSeg arXiv 21 52.6 -
LMNet RA-L 21 66.4 58.3
Cylinder3D CVPR 21 66.3 61.2
AutoMOS RA-L 22 - 54.3
MotionSeg3D, v1 IROS 22 68.1 62.5
MotionSeg3D, v2 IROS 22 71.4 64.9
4DMOS RA-L 22 77.2 65.2
MotionBEV, w/o delay RA-L 23 68.1 63.9
MotionBEV, w/ delay RA-L 23 76.5 69.7
StreamMOS-V - 78.3 73.1
LMNet* RA-L 21 67.1 62.5
RVMOS* RA-L 22 71.2 74.7
InsMOS* IROS 23 73.2 70.6
InsMOS*† IROS 23 - 75.6
MF-MOS* ICRA 24 76.1 76.7
StreamMOS-VI* - 81.6 77.8

Comparison with previous methods. We first evaluate our StreamMOS on SemanticKITTI-MOS benchmark. To ensure fairness, our method is split into two versions in Tab. I, which makes settings as consistent as possible with previous works. Specifically, (a) StreamMOS-V indicates the network that is trained in 1st stage and only uses voxel-based voting as post-processing. (b) StreamMOS-VI* means extra performing 2nd stage training and using instance-based voting that relies on movable object predictions. Then, our methods are compared with existing algorithms, which can be classified as whether semantic annotations are utilized. Specially, the v1 and v2 of MotionSeg3D [5] refer to using kNN or point refine module as post-processing. Moreover, “w/ delay” signifies exploiting point cloud frames within the time window of [t,t+N]𝑡𝑡𝑁[t,t+N][ italic_t , italic_t + italic_N ] to estimate dynamic objects in the t𝑡titalic_t frame. Note that following MotionBEV [7], our results shown in Tab. I are derived from training original SemanticKITTI without any additional data.

TABLE II: Performance comparison on Sipailou-Campus dataset.
Methods Source IoU (Val) IoU (Test)
LMNet RA-L 21 54.3 56.2
MotionSeg3D, v2 IROS 22 65.6 66.8
4DMOS RA-L 22 87.3 88.9
MotionBEV RA-L 23 89.2 90.8
StreamMOS-V - 90.9 92.5
TABLE III: Comparison of running time (ms) with previous methods.
4DMOS MF-MOS* MotionSeg3D, v1 MotionSeg3D, v2
86 96 42 117
InsMOS* RVMOS* StreamMOS-V StreamMOS-VI*
120 29 62 96

As illustrated in Tab. I, our streaming method outperforms previous works in most cases. Specifically, our StreamMOS-V exceeds 4DMOS [9] by 1.1% and 7.9% in validation and test. We think that compared with using binary Bayes filter to merge historical results in 4DMOS, our method additionally considers historical feature from the last inference, which can serve as strong spatial priors to improve prediction quality. At the same time, our StreamMOS-VI* surpasses InsMOS* [8] and MF-MOS* [25] in the validation set significantly (\uparrow8.4% and \uparrow5.5%) by instance-based voting. Finally, due to the lack of semantic annotation in Sipailou-Campus dataset, we solely list [4, 7, 5, 9] in Tab. II and confirm the effectiveness of StreamMOS-V, even using a solid-state LiDAR with the narrow field of view and non-repetitive scanning patterns.

Inference Speed. Although our method uses attention mechanism to construct feature association between inferences and merge multiple historical predictions in voting mechanism, it still keeps competitive running time compared with previous approaches in Tab. III. We believe this is the contribution of projection-based backbone, lightweight deformable attention and parameter-free upsampling in decoder, which make our method strike a balance between speed and performance.

TABLE IV: The effect of different modules in SemanticKITTI validation.
TF MVE VBV IBV IoU % ΔΔ\Deltaroman_Δ
A1 67.1 -
A2 73.2 +6.1
A3 77.1 +10.0
A4 78.3 +11.2
A5 81.3 +14.2
A6 81.6 +14.5
TABLE V: Ablation experiment on multi-view encoder of StreamMOS-V.
RV BEV ACB Parallel Series IoU [%]
B1 70.3
B2 74.2
B3 77.5
B4 74.8
B5 78.3
TABLE VI: Ablation experiment on temporal fusion of StreamMOS-V.
Strategy IoU [%] ΔΔ\Deltaroman_Δ
C1 w/o Temporal Fusion 72.1 -6.2
C2 Cross-attention 73.0 -5.3
C3 Concatenation 74.8 -3.5
C4 Addition 75.6 -2.7
C5 Deform-attention 78.3 -

IV-D Qualitative Analysis

To compare the previous methods intuitively, we visualize segmentation results in various scenarios. As demonstrated in Fig. 5, LMNet* suffers from boundary-blurring issues in 4th row. Although MotionSeg3D adopts point refinement module to alleviate this problem, it also makes mistakes when dealing with distant objects in the 2nd and 5th rows. Moreover, due to lacking the instance-level sensing, MotionSeg3D is prone to incomplete prediction, such as in the 1st row. Despite adding instance detection like InsMOS* could improve the integrity of segmentation, it will aggravate negative impacts when the prediction is incorrect, as illustrated in the 2nd, 3rd and 5th rows. Unlike the above algorithms, our StreamMOS-VI* first combines observation from multi-view to improve perception of objects at different distances. Then, instead of focusing on a single inference, we build relationship in several inferences to integrate historical feature and predictions, which are used to suppress false results and improve segmentation integrity. Thus, our method can achieve superior performance in Fig. 5.

IV-E Ablation Study

In this section, we conduct comprehensive experiments on the SemanticKITTI validation set to confirm the effectiveness of the proposed modules.

Model Components. As shown in Tab. IV, our StreamMOS mainly includes some crucial modules: temporal fusion (TF), multi-view encoder (MVE), voxel-based voting (VBV), and instance-based voting (IBV). To understand their importance in overall performance, we first remove all the above modules from our StreamMOS and regard the rest as a baseline in A1. After building feature correlations between inferences by TF, the IoU increases by 6.1% in A2. Moreover, benefiting from capturing multi-view motion cues from BEV and RV, MVE brings further improvement. Then, due to introducing object-level perception, instance-based voting in A5 shows a greater performance than voxel-based in A4, which only focuses on limited areas in the 3D cube. Finally, we can achieve optimal performance by combining them into a refinement procedure from voxel to instance, proving that effectively utilizing long-term predictions is the key element to the LiDAR MOS task.

Multi-view Encoder. We compare several multi-view encoding strategies in Tab. V. From the B1 and B2, we can observe that when encoding object motion only on a single view, the BEV representation achieves better results compared to RV due to global perspective and motion consistency. Then, we divide encoder into BEV and RV branches and extract multi-view features in series (B3), leading IoU to further increase and exceed parallel mode (B4) by 2.7%. We think that series manner may be more suitable for deriving consistent moving features from different views owing to progressive encoding. Furthermore, using asymmetric convolution block (ACB) can result in 0.8% improvement in B5, proving the advantage of decoupling horizontal and vertical encoding.

Refer to caption
Figure 6: Ablation study on the time window length of voting mechanism.
Refer to caption
Figure 7: The effect of frame number and BEV size in our SreamMOS-V.

Temporal Fusion. The strategy of propagating the historical feature into current inference will affect segmentation quality as demonstrated in Tab. VI. First, we can observe that lacking temporal fusion to provide prior information leads to unideal results (\downarrow6.2%). Then, compared with adopting concatenation and addition directly to merge features in different coordinate systems, deformable attention could align features adaptively by learnable offsets and gain the advantage of 3.5% and 2.7% IoU. Moreover, it is worth noting that cross-attention gets the worst result since redundant global attention may cause a bad effect. In contrast, deformable attention concentrates on local feature to avoid model overfitting and save computation load.

Time Window Length. The time window length determines how long ago predictions can be used by voting mechanism. Thus, we conduct experiments on the time window length to choose the optimal setting for our algorithm. As displayed in Fig. LABEL:fig:time_window_length, the performance will increase rapidly until the length M𝑀Mitalic_M of time window reaches 8. Despite continuing to raise the length could result in a slight improvement, it requires more time consumption. Thus, we opt for M𝑀Mitalic_M = 8 as our default.

Other Hyper-parameter Settings. In Fig. 7, we explore the impact of frame number and BEV resolution on performance. We can observe that the optimal BEV size (Wb,Hb)superscript𝑊𝑏superscript𝐻𝑏(W^{b},H^{b})( italic_W start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) is 512×512512512512\times 512512 × 512. Meanwhile, too small BEV resolution would cause the network to be unable to capture the motion of small objects, while excessively large resolution leads to sensitivity to slight disturbances. Besides, a larger BEV image will contain more numerous empty grids, which may dilute useful information.

Furthermore, as shown in Fig. 7(b), compared to previous approaches [8, 9] that require a lot of frames to extract the spatial-temporal features, our method only relies on 3 frames to achieve the best result. We think this is due to the effective reuse of historical feature and predictions in temporal fusion and voting, which has taken rich prior knowledge to network. Meanwhile, too many frames are fed into the network would cause information redundancy and result in degradation.

V Conclusion

In this paper, we analyse the limitations of existing MOS methods and propose a novel streaming structure, which uses memory bank as a bridge to transfer prior information among inferences. Moreover, our StreamMOS captures the complete appearance and motion features of objects from multi-view. To correct false prediction in a single inference, we propose a voting mechanism to integrate historical predictions at voxel and instance levels. The broad experimental results show the effectiveness of proposed modules and prove that our method has competitive performance in diverse aspects.

References

  • [1] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-time.,” in Robotics: Science and systems, vol. 2, pp. 1–9, Berkeley, CA, 2014.
  • [2] B. Guo, N. Guo, and Z. Cen, “Obstacle avoidance with dynamic avoidance risk region for mobile robots in dynamic environments,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 5850–5857, 2022.
  • [3] P. Chen, J. Pei, W. Lu, and M. Li, “A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance,” Neurocomputing, vol. 497, pp. 64–75, 2022.
  • [4] X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss, “Moving object segmentation in 3d lidar data: A learning-based approach exploiting sequential data,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6529–6536, 2021.
  • [5] J. Sun, Y. Dai, X. Zhang, J. Xu, R. Ai, W. Gu, and X. Chen, “Efficient spatial-temporal information fusion for lidar-based 3d moving object segmentation,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11456–11463, IEEE, 2022.
  • [6] S. Mohapatra, M. Hodaei, S. Yogamani, S. Milz, H. Gotzig, M. Simon, H. Rashed, and P. Maeder, “Limoseg: Real-time bird’s eye view based lidar motion segmentation,” arXiv preprint arXiv:2111.04875, 2021.
  • [7] B. Zhou, J. Xie, Y. Pan, J. Wu, and C. Lu, “Motionbev: Attention-aware online lidar moving object segmentation with bird’s eye view based appearance and motion features,” arXiv preprint arXiv:2305.07336, 2023.
  • [8] N. Wang, C. Shi, R. Guo, H. Lu, Z. Zheng, and X. Chen, “Insmos: Instance-aware moving object segmentation in lidar data,” arXiv preprint arXiv:2303.03909, 2023.
  • [9] B. Mersch, X. Chen, I. Vizzo, L. Nunes, J. Behley, and C. Stachniss, “Receding moving object segmentation in 3d lidar data using sparse 4d convolutions,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7503–7510, 2022.
  • [10] J. Schauer and A. Nüchter, “The peopleremover—removing dynamic objects from 3-d point cloud data by traversing a voxel occupancy grid,” IEEE robotics and automation letters, vol. 3, no. 3, pp. 1679–1686, 2018.
  • [11] S. Pagad, D. Agarwal, S. Narayanan, K. Rangan, H. Kim, and G. Yalla, “Robust method for removing dynamic objects from point clouds,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10765–10771, IEEE, 2020.
  • [12] F. Pomerleau, P. Krüsi, F. Colas, P. Furgale, and R. Siegwart, “Long-term 3d map maintenance in dynamic environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3712–3719, 2014.
  • [13] G. Kim and A. Kim, “Remove, then revert: Static point cloud map construction using multiresolution range images,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10758–10765, IEEE, 2020.
  • [14] R. Ambruş, N. Bore, J. Folkesson, and P. Jensfelt, “Meta-rooms: Building and maintaining long term spatial models in a dynamic world,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1854–1861, 2014.
  • [15] H. Lim, S. Hwang, and H. Myung, “Erasor: Egocentric ratio of pseudo occupancy-based dynamic object removal for static 3d point cloud map building,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2272–2279, 2021.
  • [16] T. Kreutz, M. Mühlhäuser, and A. S. Guinea, “Unsupervised 4d lidar moving object segmentation in stationary settings with multivariate occupancy time series,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1644–1653, 2023.
  • [17] J. Kim, J. Woo, and S. Im, “Rvmos: Range-view moving object segmentation leveraged by semantic and motion features,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8044–8051, 2022.
  • [18] X. Li, G. Zhang, H. Pan, and Z. Wang, “Cpgnet: Cascade point-grid fusion network for real-time lidar semantic segmentation,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 11117–11123, IEEE, 2022.
  • [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [20] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” 2021.
  • [21] R. Li, S. Li, X. Chen, T. Ma, W. Hao, J. Gall, and J. Liang, “Tfnet: Exploiting temporal cues for fast and accurate lidar semantic segmentation,” arXiv preprint arXiv:2309.07849, 2023.
  • [22] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, pp. 226–231, 1996.
  • [23] M. Berman, A. R. Triki, and M. B. Blaschko, “The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4413–4421, 2018.
  • [24] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
  • [25] J. Cheng, K. Zeng, Z. Huang, X. Tang, J. Wu, C. Zhang, X. Chen, and R. Fan, “Mf-mos: A motion-focused model for moving object segmentation,” arXiv preprint arXiv:2401.17023, 2024.