StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Zhiheng Li, Yubo Cui, Jiexi Zhong, Zheng Fang This work was supported by the National Natural Science Foundation of China under Grants 62073066, the Fundamental Research Funds for Central Universities under Grant N2226001, and 111 Project under Grant B16009. (Corresponding author: Zheng Fang, e-mail: [email protected])The authors are all with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China;

Abstract

Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may cause inconsistent segmentation results for the same object in different frames. To overcome this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial prior of moving objects and adopted to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine the present forecast at voxel and instance levels through voting. Besides, we present multi-view encoder with cascade projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. Code will be released at https://github.com/NEU-REAL/StreamMOS.git.

I INTRODUCTION

In urban roads, there are often many dynamic objects with variable trajectories, such as vehicles and pedestrians, which create the collision risk for autonomous vehicles. Meanwhile, the moving objects will cause errors in simultaneous localization and mapping (SLAM) [1] as well as bringing challenges for obstacle avoidance [2] and path planning [3]. As a result, online moving object segmentation (MOS) based on LiDAR points has become a crucial task in multiple fields. However, owing to the unordered and sparsity natures of LiDAR points, MOS still faces some challenging cases, especially difficulty in perceiving moving objects with sparse points at a distance.

To tackle the above problem, the mainstream strategy is to exploit spatio-temporal information from LiDAR sequences. For instance, Chen et al. [4] generate residual image in range view (RV), which reflects spatial position of dynamic objects in each frame and can be utilized to perform temporal fusion to predict moving objects. Following the RV-based projection in [4], Sun et al. [5] adopt motion-guided attention to better explore temporal motion cues from residual images. Besides, some works [6, 7] attempt to map point clouds on bird’s eye view (BEV) and ensure consistent object size and movement. Recently, Wang et al. [8] process LiDAR sequences directly via 4D convolution to construct temporal associations while adding instance detection to promote segmentation integrity.

Refer to caption — Figure 1: Pipeline comparison of moving object segmentation approaches. We compare the structure of proposed StreamMOS with previous methods in (a) and (b). Meanwhile, the segmentation results obtained by our method achieve better spatial integrity and temporal continuity in (c).

However, as displayed in Fig. 1(a), these methods focus on temporal fusion in a single inference and make independent predictions for each frame, leading to inconsistent results for the same object at different moments (in Fig. 1(c)). Despite Mersch et al. [9] leveraging a binary Bayes filter to combine multiple predictions, it still ignores information transmission at feature level, which supplies rich spatial context to the next inference. Thus, we present a “streaming” structure as shown in Fig. 1(b), which regards historical feature as a strong prior and exploits it to guide the current inference. Meanwhile, the past predictions are stored in long-term memory and utilized to suppress false predictions. In this way, we construct robust correlations in multiple inferences and fully explore temporal information to ensure consistent results in different frames.

To implement the idea of streaming, we propose a moving object segmentator, called StreamMOS, which encodes object motion cues from multi-view and adopts dual-span memory to transfer historical information. Specifically, different from previous works that map point clouds on one view, we argue that various viewpoints provide more holistic observations of dynamic objects. Thus, we propose a multi-view encoder that applies a cascade structure to iteratively get dense appearance from RV and perceive intuitive motion on BEV, resulting in more distinguishable features of dynamic objects (Fig. 3(b)). Meanwhile, during BEV encoding, we introduce asymmetric convolution with decoupled strategy to better capture vertical and horizontal motion. Then, we use attention mechanism to implement temporal fusion that aligns features from different times and conveys spatial prior to current inference. Besides, due to the inherent uncertainty of neural networks, the output of segmentation decoder may be inconsistent across frames (Fig. 1(c)). To solve this issue, we propose voting mechanism as post-processing to optimize predicted labels. Its core idea is to statistically analyze long-term motion states at the voxel instance levels, and then select the most likely state to update raw point-wise forecasts. In this way, the previous results can be used to refine current predictions, enhancing the temporal continuity and spatial completeness of segmentation together.

In sammary, the contributions of our work are as follows:

•

We present a novel streaming framework called StreamMOS, which exploits short-term and long-term memory to construct associations among inferences and improve the integrity and continuity of predictions in MOS task.
•

We propose a multiple projection architecture to capture the object motion and complete appearance from multi-view. We also present a multi-level voting mechanism to refine segmentation results for every voxel and instance.
•

The extensive experiments confirm that our StreamMOS outperforms previous algorithms on the SemanticKITTI (77.8%) and Sipailou Campus (92.5%) while running in real-time. Our code will be available to the community.

II RELATED WORK

II-A Geometric-based Algorithms

The initial LiDAR-based MOS methods could be referred to as the geometric-based approaches, which typically build the map in advance and remove any dynamic objects through estimating occupancy probability and determining visibility. For example, Schauer et al. [10] proposed a ray casting-based approach that counted the hits and misses of scans to update the occupancy situation of the grid map. Afterwards, Pagad et al. [11] constructed an occupancy octree map and proposed a probability update mechanism to obtain clean point clouds by considering the occupancy history. Despite getting promising results, [10, 11] suffer extensive computational burden due to the ray casting and updating voxel one by one. To improve efficiency, several visibility-based [12, 13, 14] algorithms have been developed. Pomerleau et al. [12] identified moving objects by checking whether the points of the pre-built map are occluded by the points in the query frame. Meanwhile, to avoid mismarked ground points as dynamic reported in [12], Kim et al. [13] retained ground points from removed points using a multi-resolution reverting algorithm. Moreover, Lim et al. [15] introduced a visibility-free approach that removed moving traces by computing pseudo occupancy ratio between the query scan and submap in each grid. Although the above methods can distinguish the motion state of objects and clean maps well, they are often performed offline due to requiring a prior map and may be not suitable for real-time applications.

II-B Learning-based Algorithms

Recently, many studies have focused on utilizing learning-based approaches to eliminate dynamic objects online, which only take consecutive frame point clouds as input rather than a pre-built map. Meanwhile, according to data representation, these approaches could be grouped into projection-based and point-based methods. The former converts point clouds into bird’s eye view (BEV) or range view (RV) images, while the latter processes 3D raw points directly.

Specifically, for point-based algorithms, Mersch et al. [9] adopted sparse 4D convolutions to process a series of LiDAR scans and predicted moving objects in each frame. They also employed a binary Bayes filter to fuse multiple predictions in a sliding window. Subsequently, Kreutz et al. [16] proposed an unsupervised approach to address MOS task in stationary LiDAR and viewed it as a multivariate time series clustering problem. Lately, Wang et al. [8] introduced InsMOS to unify detection and segmentation of moving objects into a network, so that the instance cues can be used to improve segmentation integrity. Although they achieved promising performance, the feature extraction of numerous points in [8] may cause high computational costs.

Compared to the mentioned approaches, projection-based algorithms [4, 5, 17, 6, 7] are generally more efficient owing to handling ordered and dense data. For instance, Chen et al. [4] mapped LiDAR scans into spherical coordinates and generated residual images to extract dynamic information in sequence. Sun et al. [5] designed a dual-branch to explore the spatial-temporal information and relieved boundary blurring problem by a point refinement module. Furthermore, Kim et al. [17] achieved higher performance by using extra semantic features. In contrast to range projection, Mohapatra et al. [6] and Zhou et al. [7] utilized BEV projection to obtain a more intuitive motion representation but the serious loss of spatial information still limited their performance. Thus, to address this issue, our StreamMOS exploits a multi-view encoder to capture object motion from BEV and RV in a series manner, which not only allows for complete observation of objects but also alleviates computational effort. Meanwhile, we construct memory banks to pass past knowledge to current inference, resulting in consistent segmentation across a long sequence.

III Methodology

III-A Framework Overview

LiDAR-based MOS aims to determine the motion state of each point in the current scan based on the multi-frame point clouds $\{\mathcal{P}_{t-n}\}_{n=0}^{N}$ . To this end, existing methods first adopt the relative pose transformations $\{\mathcal{T}_{t-n\rightarrow t}\}_{n=1}^{N}$ provided by the LiDAR odometry to project the history scans $\{\mathcal{P}_{t-n}\}_{n=1}^{N}$ into ego car coordinate system of the current scan $\mathcal{P}_{t}$ and get $\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}$ . Then, they usually feed $\mathcal{P}_{t}$ and $\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}$ into a network $\Psi$ to fuse spatio-temporal information and predict classification results $\mathcal{M}_{t}\in\mathbb{R}^{V\times 3}$ of all points in $\mathcal{P}_{t}$ , where $V\times 3$ refers to the probability that $V$ points belong to three categories, including unknown, static and moving states.

Different from previous approaches that focus on temporal fusion in a single inference, we extra consider the association among multiple inferences and leverage history feature $\mathcal{H}_{t-1}$ and predictions $\{\mathcal{M}_{t-m}\}_{m=1}^{M}$ to raise the quality of current inference. Thus, our method formulates MOS task as follows:

\displaystyle\mathcal{M}_{t}=\Psi(\mathcal{P}_{t},\{\mathcal{P}^{\prime}_{t-n}% \}_{n=1}^{N},\mathcal{H}_{t-1},\{\mathcal{M}_{t-m}\}_{m=1}^{M})

(1)

where $N,M$ are the number of historical LiDAR frames and forecasts. Meanwhile, the details of our network are shown in Fig. 2. Specifically, given a series of scans, our StreamMOS first utilizes a multi-view encoder to capture the motion cues from the viewpoints of BEV and RV. Thereafter, we can get a motion feature $\mathcal{F}_{t}$ that reflects spatial information of moving objects in the current frame. Then, we use a temporal fusion module to combine $\mathcal{F}_{t}$ with historical feature $\mathcal{H}_{t-1}$ retained in short-term memory. By doing this, some prior information can be transferred to the current inference and further used to decode movable objects $\mathcal{O}_{t}$ as well as coarse motion state $\mathcal{C}_{t}$ for all points. Finally, we apply a voting mechanism to update $\mathcal{C}_{t}$ with historical results $\{\mathcal{M}_{t-m}\}_{m=1}^{M}$ stored in long-term memory and instance information derived from $\mathcal{O}_{t}$ , thereby yielding the refined prediction $\mathcal{M}_{t}$ .

III-B Multi-Projection Feature Encoder

III-B1 Preliminaries

Unlike the existing methods that project point clouds into a single view, such as BEV [7] or RV [17], we believe that mapping points to these views simultaneously could capture more complete appearance and obvious motion cues of dynamic objects. Meanwhile, as shown in the bottom of Fig. 2, the points could be considered as the intermediate carrier to transfer information between different perspectives. To achieve these, we use the Point-to-BEV (P2B) and Point-to-Range (P2R) to project point features into 2D plane while using the BEV-to-Point (B2P) and Range-to-Point (R2P) to gather point features from multi-view. Specifically, assuming that the $k^{th}$ 3D point in the $\mathcal{P}_{t}$ is noted as $p_{k}^{3D}=(x_{k},y_{k},z_{k})$ , the P2B project it into a rectangular 2D grid and obtain its coordinate $(u_{k}^{b},v_{k}^{b})$ in BEV. For the P2R, the point $p_{k}^{3D}$ with 3D cartesian coordinate is converted into spherical coordinate $p_{k}^{sph}=(r_{k},\theta_{k},\phi_{k})$ and assigned to the 2D grid in RV with coordinate $(u_{k}^{r},v_{k}^{r})$ [18], where $r_{k}$ , $\theta_{k}$ , $\phi_{k}$ represent distance, zenith and azimuth angle of point $p_{k}^{3D}$ . The points falling into the same grid undergo max-pooling to aggregate features. For R2P and B2P, the grid features of RV and BEV are allocated to 3D points using bilinear interpolation within nearby grids.

III-B2 Network Structure

In the feature encoder, we first use a lightweight PointNet [19] as point-wise encoder to process point clouds ( $\mathcal{P}_{t},\{\mathcal{P}^{\prime}_{t-n}\}_{n=1}^{N}$ ) and obtain $\mathcal{E}_{n}\in\mathbb{R}^{V\times C}\ (n\in\{t-N,...,t\})$ , where $C$ means the number of channel. Then, for the feature of each frame, we adopt P2B to project them into BEV and concentrate them along the channel dimension to get BEV feature $\mathcal{G}_{t}^{0}\in\mathbb{R}^{W^{b}\times H^{b}\times(N+1)C}$ , where $W^{b},H^{b}$ are the predefined width and height of BEV. Afterwards, we feed $\mathcal{G}_{t}^{0}$ into multi-view encoder (MVE) to extract temporal information and capture object motion from different views.

In the lower part of Fig. 2, after downsampling BEV feature $\mathcal{G}_{t}^{l}(l\in\{0,1\})$ , we introduce an asymmetric convolution block (ACB) to perceive the movement of objects. As shown in Fig. 3(a), compared to the typical symmetric convolutional kernel (e.g. 3 $\times$ 3), the kernel size of ACB has one side longer (e.g. 3 $\times$ 5 and 5 $\times$ 3). Besides, it decouples feature extraction into the horizontal and vertical directions, defined as follows:

\displaystyle f^{\prime}=\text{Conv}_{3\times 3}(\text{Conv}_{h}(f)\odot\text{% Conv}_{v}(f))+f

(2)

where $f$ and $\odot$ are feature map and concatenation operation. $\text{Conv}_{h}$ and $\text{Conv}_{v}$ mean asymmetric convolutions, which can expand the receptive field and improve perception ability for dynamic objects since they usually have obvious motion in a certain direction. After that, as displayed in Fig. 2, we apply B2P and P2R to project BEV feature $\mathcal{G}^{b}$ into range view and use convolution layers to generate another motion feature $\mathcal{G}^{r}$ , which is further remapped into BEV and fuse with $\mathcal{G}^{b}$ . Thus, complete motion cues can be encoded by cascade projection.

Thereafter, we stack two MVEs with a BEV-view encoder to get discriminative motion feature $\mathcal{F}_{t}$ . Specially, we visualize multi-view features of different MVE layers in Fig. 3(b). It proves that MVE can extract consistent object information across various perspectives, while the deeper layer is capable of suppressing noise and preserving clearer motion features.

III-C Short-term Temporal Fusion

The purpose of this part is to transfer the memory feature $\mathcal{H}_{t-1}$ from the last inference to the present, so that historical spatial states of objects can be reused to guide the network to deduce object motion in the $t$ time. In this regard, we first build short-term memory bank as a bridge to store $\mathcal{H}_{t-1}$ and connect adjacent inference. Then, since $\mathcal{F}_{t}$ and $\mathcal{H}_{t-1}$ are not in the same coordinate system, we use learnable offsets [20] to adaptively find the relationship between two features and combine them by attention weight. Specifically, the $\mathcal{H}_{t-1}$ is first fed into two linear layers to produce $K$ attention weights $A_{k}$ and sampling offsets $\Delta g_{k}$ . Afterward based on the offsets $\Delta g_{k}$ and coordinates $g_{k}$ of reference points in $\mathcal{F}_{t}$ , a bilinear interpolation is used to gather reference values $G_{k}$ from $\mathcal{F}_{t}$ . Finally, $G_{k}$ is weighted by $A_{k}$ and get a updated feature ${\hat{\mathcal{H}}_{t}}$ . The above process can be formulated as follows:

A_{k}=\text{Softmax}(\text{Linear}(\mathcal{H}_{t-1})),\ \Delta g_{k}=\text{% Linear}(\mathcal{H}_{t-1})

(3)

G_{k}=S(\mathcal{F}_{t},g_{k}+\Delta g_{k})

(4)

{\hat{\mathcal{H}}_{t}}=\sum_{l=1}^{L}W_{l}(\sum_{k=1}^{K}A_{lk}\cdot G_{lk})

(5)

where $L,K$ are the number of attention heads and reference points, respectively. The $S(\textperiodcentered\cdot\textperiodcentered)$ and $W_{l}$ represent the bilinear sampling and learnable weight of multi-head attention. Later, the ${\hat{\mathcal{H}}_{t}}$ is processed by normalization layer and feed-forward network (FFN) to generate a renewed $\mathcal{H}_{t}$ in the current time:

\tilde{\mathcal{H}}_{t}=\text{LN}(\hat{\mathcal{H}}_{t}+\mathcal{H}_{t-1}),\ % \mathcal{H}_{t}=\text{LN}(\text{FFN}(\tilde{\mathcal{H}}_{t})+\tilde{\mathcal{% H}}_{t})

(6)

Here, LN is the layer normalization. And the $\mathcal{H}_{t}$ will replace $\mathcal{H}_{t-1}$ of short-term memory and participate in the prediction of segmentation decoder.

III-D Reduced-Parameter Segmentation Decoder

To distinguish the static and dynamic points, the previous methods [7, 5, 8] usually leverage a UNet-like decoder to upsample multi-scale features progressively by convolutions. However, to reduce complexity and storage costs of network, we introduce a lightweight decoder, which first employs bilinear interpolation to convert the size of multi-scale features $\mathcal{G}_{t}^{i}\ (i\in{1,2})$ and $\mathcal{H}_{t}$ into the uniform height $H^{b}/2$ and width $W^{b}/2$ . For each upsampled feature, we employ an auxiliary head to predict moving objects in BEV and exploit auxiliary loss for constraints, which could guarantee that features from multi-scale are aligned and decoded well. Thereafter, we use B2P to convert upsampled features into point features one by one and concatenate them to get $\mathcal{\hat{E}}_{t}\in\mathbb{R}^{V\times D}$ . Finally, except decoding coarse motion states $\mathcal{C}_{t}\in\mathbb{R}^{V\times 3}$ of LiDAR points $\mathcal{P}_{t}$ , point-wise decoder extra predicts probability $\mathcal{O}_{t}\in\mathbb{R}^{V\times 2}$ that points belong to movable objects (e.g. cars, bicycles) and static backgrounds, such as building, parking and road:

\mathcal{C}_{t}=\text{MLP}_{1}(\mathcal{E}_{t}\odot\mathcal{\hat{E}}_{t}),\ % \mathcal{O}_{t}=\text{MLP}_{2}(\mathcal{E}_{t}\odot\mathcal{\hat{E}}_{t})

(7)

where $\text{MLP}_{1},\text{MLP}_{2}$ mean multi-layer perceptrons that do not share weights, and $\mathcal{E}_{t}$ is the feature from point-wise encoder. According to discrete classification labels $\mathcal{O}_{t}$ , we can acquire the attributes of instance, like location and size, by clustering and apply them to optimize $\mathcal{C}_{t}$ in the subsequent voting stage.

III-E Long-term Voting Mechanism

Most existing approaches [4, 5, 17] focus on improving the quality of a single inference through modifying network structure. Nevertheless, in light of the inexplicability and data dependency of neural networks, this strategy may be limited. For example, for a parking car shown in Fig. 1(c), the model may predict it as stationary at this frame and dynamic at other frames. Meanwhile, due to lacking instance-level perception ability, the network may generate inconsistent results for the different parts of an object. To solve these issues, we present a voting mechanism consisting of voxel-based voting (VBV) and instance-based voting (IBV), which could be regarded as a post-processing that corrects errors in the current predicted labels $\mathcal{C}_{t}$ using historical results $\{\mathcal{M}_{t-m}\}^{M}_{m=1}$ and movable labels $\mathcal{O}_{t}$ . Noted that $\mathcal{M}_{t-m}\ (m=1,...,M)$ with coordinate of $\mathcal{P}_{t-m}$ will be transformed into coordinate system of $\mathcal{P}_{t}$ to yield $\mathcal{M}^{\prime}_{t-m}$ by pose transformations $\mathcal{T}_{t-m\rightarrow t}$ in advance.

III-E1 Voxel-based voting

Inspired by TFNet [21], we obtain the historical predictions $\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1}$ and current forecast $\mathcal{C}_{t}$ in the same coordinate system. Afterward we divide points $\mathcal{P}_{t}$ into voxels with fixed size and fill $(\mathcal{C}_{t},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1})$ into each voxel. Then, as shown in Fig. 4(a), the most frequently predicted label acts as motion state for all points in the same voxel and incorrect labels will be updated. We brief the above procedure of VBV as: $\Omega(\mathcal{C}_{t},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1})\mapsto\hat{% \mathcal{C}_{t}}$ .

III-E2 Instance-based voting

Although VBV can ensure the consistency of motion states within a local area and achieve performance improvement in Tab. IV, it is difficult to achieve instance-level state unity, as shown output from VBV $\hat{\mathcal{C}_{t}}$ in Fig. 4(b). To solve this, we propose an instance-based voting based on cluster. Specifically, given predicted probability $\mathcal{O}_{t}$ from decoder, we can pick out the foreground points ${\mathcal{P}^{\prime}_{t}}$ from ${\mathcal{P}_{t}}$ and adopt DBSCAN [22] to split ${\mathcal{P}^{\prime}_{t}}$ into $S$ clusters. Then, according to the coordinates of points in each cluster, we can compute $S$ minimum 3D bounding boxes to cover all objects. Thus, we can further crop out instance-level predictions from $\hat{\mathcal{C}_{t}}$ and memory predictions $\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1}$ . Finally, similar to voxel-based voting, we adopt the class label with the highest quantity as the motion state for all points in the instance and get the final prediction as: $\Phi(\hat{\mathcal{C}_{t}},\{\mathcal{M}^{\prime}_{t-m}\}^{M}_{m=1},\mathcal{O% }_{t})\mapsto{\mathcal{M}_{t}}$ .

III-E3 Long-term memory updating

We construct a memory bank with the length of $M$ to retain historical results. When a new refined prediction $\mathcal{M}_{t}$ is output from voting mechanism, we append it to memory bank and pull out the oldest result.

As a result, compared to only relying on network adaptive learning, our vote mechanism could explicitly suppress incorrect prediction and improve the consistency of segmentation by analyzing long-term prediction in voxel and instance level.

III-F Loss Functions

To ensure the network can be optimized fully, we separate the training process into two phases. In the 1^st stage, we only train our network without predicting movable objects in the decoder. Meanwhile, following the previous works [7, 5], we introduce the weighted cross-entropy ( $L_{wce}$ ) and Lovász-Softmax ( $L_{ls}$ ) [23] losses to supervise network:

L=\lambda_{1}L_{wce}+\lambda_{2}L_{ls},\ L_{s1}=L(y,\hat{y})+\lambda_{3}\sum_{% i=1}^{3}L(y_{i}^{b},\hat{y}_{i}^{b})

(8)

where $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ mean the weights for losses while $y$ and $\hat{y}$ are the ground truth and predicted results of points. $L(y_{i}^{b},\hat{y}_{i}^{b})$ denote the auxiliary losses for BEV predictions. Moreover, in the 2^nd stage, we freeze the pre-trained parameters that are optimized in the 1^st stage and only train the rest of network to estimate movable objects by loss function as follows:

\ L_{s2}=\lambda_{1}L_{wce}(x,\hat{x})+\lambda_{2}L_{ls}(x,\hat{x})

(9)

where $x$ and $\hat{x}$ represent ground-truth labels and predictions for movable objects.

IV EXPERIMENTS

IV-A Experimental Settings

Datasets. On SemanticKITTI-MOS [4] dataset and Sipailou-Campus [7] dataset, we compare segmentation performance with previous methods and conduct extensive ablation studies. The SemanticKITTI-MOS dataset is collected by Velodyne HDL-64E LiDAR and contains a total of 22 sequences with labeled point clouds that are remapped from 28 semantic classes into 3 types of motion states. Following the previous algorithms [7, 5, 8], we divide the sequences 00-07, 09-10 (19,130 frames) for training, sequence 08 (4,071 frames) for validation and sequences 11-21 (20,351 frames) for testing. For Sipailou-Campus that is constructed based on solid-state LiDAR, we follow the implementation of [7] to split dataset into 5 train sequences (16,887 frames), 1 validation sequence (3,191 frames) and 2 test sequences (6,201 frames).

Evaluation Metric. Consistent with present approaches [8, 5], we employ the Jaccard Index or Intersection-over-Union (IoU) metric [24] over dynamic objects to measure the MOS performance, which can be denoted as:

\text{IoU}=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}}

(10)

where TP, FP, and FN mean the number of true positive, false positive, and false negative predictions for dynamic category.

IV-B Implementation Details

In data processing, we adopt widely-used data augmentation, such as random rotation, flipping and slight translation to enrich the training data, which plays an important role in improving model generalization. Meanwhile, as mentioned in Sec. III-F, we optimize the network using two-stage training strategy. For the $1^{st}$ stage, we train the model for 48 epochs on NVIDIA RTX 4090 GPUs using an SGD optimizer with an initial learning rate of 0.02, which is decayed by 0.1 every 10 epochs. For the $2^{nd}$ stage, we solely optimize the network for 10 epochs with a learning rate of 0.02. Furthermore, each LiDAR scan is limited to [-50 $m$ , 50 $m$ ] for the X and Y axes and [-4 $m$ , 2 $m$ ] for the Z axis. The number of points in each scan is randomly downsampled or filled to $V$ = 1.3 $\times$ 10⁵.

IV-C Quantitative Results

TABLE I: Performance comparison on SemanticKITTI validation and test sets. * denotes the method exploiting semantic labels. † means the method is trained both on SemanticKITTI and KITTI-road datasets.

Methods	Source	IoU (Val)	IoU (Test)
KPConv	ICCV 19	-	60.9
SpSequenceNet	CVPR 20	-	43.2
LiMoSeg	arXiv 21	52.6	-
LMNet	RA-L 21	66.4	58.3
Cylinder3D	CVPR 21	66.3	61.2
AutoMOS	RA-L 22	-	54.3
MotionSeg3D, v1	IROS 22	68.1	62.5
MotionSeg3D, v2	IROS 22	71.4	64.9
4DMOS	RA-L 22	77.2	65.2
MotionBEV, w/o delay	RA-L 23	68.1	63.9
MotionBEV, w/ delay	RA-L 23	76.5	69.7
StreamMOS-V	-	78.3	73.1
LMNet*	RA-L 21	67.1	62.5
RVMOS*	RA-L 22	71.2	74.7
InsMOS*	IROS 23	73.2	70.6
InsMOS*†	IROS 23	-	75.6
MF-MOS*	ICRA 24	76.1	76.7
StreamMOS-VI*	-	81.6	77.8

Comparison with previous methods. We first evaluate our StreamMOS on SemanticKITTI-MOS benchmark. To ensure fairness, our method is split into two versions in Tab. I, which makes settings as consistent as possible with previous works. Specifically, (a) StreamMOS-V indicates the network that is trained in 1^st stage and only uses voxel-based voting as post-processing. (b) StreamMOS-VI* means extra performing 2^nd stage training and using instance-based voting that relies on movable object predictions. Then, our methods are compared with existing algorithms, which can be classified as whether semantic annotations are utilized. Specially, the v1 and v2 of MotionSeg3D [5] refer to using kNN or point refine module as post-processing. Moreover, “w/ delay” signifies exploiting point cloud frames within the time window of $[t,t+N]$ to estimate dynamic objects in the $t$ frame. Note that following MotionBEV [7], our results shown in Tab. I are derived from training original SemanticKITTI without any additional data.

TABLE II: Performance comparison on Sipailou-Campus dataset.

Methods	Source	IoU (Val)	IoU (Test)
LMNet	RA-L 21	54.3	56.2
MotionSeg3D, v2	IROS 22	65.6	66.8
4DMOS	RA-L 22	87.3	88.9
MotionBEV	RA-L 23	89.2	90.8
StreamMOS-V	-	90.9	92.5

TABLE III: Comparison of running time (ms) with previous methods.

4DMOS	MF-MOS*	MotionSeg3D, v1	MotionSeg3D, v2
86	96	42	117
InsMOS*	RVMOS*	StreamMOS-V	StreamMOS-VI*
120	29	62	96

As illustrated in Tab. I, our streaming method outperforms previous works in most cases. Specifically, our StreamMOS-V exceeds 4DMOS [9] by 1.1% and 7.9% in validation and test. We think that compared with using binary Bayes filter to merge historical results in 4DMOS, our method additionally considers historical feature from the last inference, which can serve as strong spatial priors to improve prediction quality. At the same time, our StreamMOS-VI* surpasses InsMOS* [8] and MF-MOS* [25] in the validation set significantly ( $\uparrow$ 8.4% and $\uparrow$ 5.5%) by instance-based voting. Finally, due to the lack of semantic annotation in Sipailou-Campus dataset, we solely list [4, 7, 5, 9] in Tab. II and confirm the effectiveness of StreamMOS-V, even using a solid-state LiDAR with the narrow field of view and non-repetitive scanning patterns.

Inference Speed. Although our method uses attention mechanism to construct feature association between inferences and merge multiple historical predictions in voting mechanism, it still keeps competitive running time compared with previous approaches in Tab. III. We believe this is the contribution of projection-based backbone, lightweight deformable attention and parameter-free upsampling in decoder, which make our method strike a balance between speed and performance.

TABLE IV: The effect of different modules in SemanticKITTI validation.

	TF	MVE	VBV	IBV	IoU %	$\Delta$
A1					67.1	-
A2	✓				73.2	+6.1
A3	✓	✓			77.1	+10.0
A4	✓	✓	✓		78.3	+11.2
A5	✓	✓		✓	81.3	+14.2
A6	✓	✓	✓	✓	81.6	+14.5

TABLE V: Ablation experiment on multi-view encoder of StreamMOS-V.

	RV	BEV	ACB	Parallel	Series	IoU [%]
B1	✓					70.3
B2		✓				74.2
B3	✓	✓			✓	77.5
B4	✓	✓		✓		74.8
B5	✓	✓	✓		✓	78.3

TABLE VI: Ablation experiment on temporal fusion of StreamMOS-V.

	Strategy	IoU [%]	$\Delta$
C1	w/o Temporal Fusion	72.1	-6.2
C2	Cross-attention	73.0	-5.3
C3	Concatenation	74.8	-3.5
C4	Addition	75.6	-2.7
C5	Deform-attention	78.3	-

IV-D Qualitative Analysis

To compare the previous methods intuitively, we visualize segmentation results in various scenarios. As demonstrated in Fig. 5, LMNet* suffers from boundary-blurring issues in 4^th row. Although MotionSeg3D adopts point refinement module to alleviate this problem, it also makes mistakes when dealing with distant objects in the 2^nd and 5^th rows. Moreover, due to lacking the instance-level sensing, MotionSeg3D is prone to incomplete prediction, such as in the 1^st row. Despite adding instance detection like InsMOS* could improve the integrity of segmentation, it will aggravate negative impacts when the prediction is incorrect, as illustrated in the 2^nd, 3^rd and 5^th rows. Unlike the above algorithms, our StreamMOS-VI* first combines observation from multi-view to improve perception of objects at different distances. Then, instead of focusing on a single inference, we build relationship in several inferences to integrate historical feature and predictions, which are used to suppress false results and improve segmentation integrity. Thus, our method can achieve superior performance in Fig. 5.

IV-E Ablation Study

In this section, we conduct comprehensive experiments on the SemanticKITTI validation set to confirm the effectiveness of the proposed modules.

Model Components. As shown in Tab. IV, our StreamMOS mainly includes some crucial modules: temporal fusion (TF), multi-view encoder (MVE), voxel-based voting (VBV), and instance-based voting (IBV). To understand their importance in overall performance, we first remove all the above modules from our StreamMOS and regard the rest as a baseline in A1. After building feature correlations between inferences by TF, the IoU increases by 6.1% in A2. Moreover, benefiting from capturing multi-view motion cues from BEV and RV, MVE brings further improvement. Then, due to introducing object-level perception, instance-based voting in A5 shows a greater performance than voxel-based in A4, which only focuses on limited areas in the 3D cube. Finally, we can achieve optimal performance by combining them into a refinement procedure from voxel to instance, proving that effectively utilizing long-term predictions is the key element to the LiDAR MOS task.

Multi-view Encoder. We compare several multi-view encoding strategies in Tab. V. From the B1 and B2, we can observe that when encoding object motion only on a single view, the BEV representation achieves better results compared to RV due to global perspective and motion consistency. Then, we divide encoder into BEV and RV branches and extract multi-view features in series (B3), leading IoU to further increase and exceed parallel mode (B4) by 2.7%. We think that series manner may be more suitable for deriving consistent moving features from different views owing to progressive encoding. Furthermore, using asymmetric convolution block (ACB) can result in 0.8% improvement in B5, proving the advantage of decoupling horizontal and vertical encoding.

Temporal Fusion. The strategy of propagating the historical feature into current inference will affect segmentation quality as demonstrated in Tab. VI. First, we can observe that lacking temporal fusion to provide prior information leads to unideal results ( $\downarrow$ 6.2%). Then, compared with adopting concatenation and addition directly to merge features in different coordinate systems, deformable attention could align features adaptively by learnable offsets and gain the advantage of 3.5% and 2.7% IoU. Moreover, it is worth noting that cross-attention gets the worst result since redundant global attention may cause a bad effect. In contrast, deformable attention concentrates on local feature to avoid model overfitting and save computation load.

Time Window Length. The time window length determines how long ago predictions can be used by voting mechanism. Thus, we conduct experiments on the time window length to choose the optimal setting for our algorithm. As displayed in Fig. LABEL:fig:time_window_length, the performance will increase rapidly until the length $M$ of time window reaches 8. Despite continuing to raise the length could result in a slight improvement, it requires more time consumption. Thus, we opt for $M$ = 8 as our default.

Other Hyper-parameter Settings. In Fig. 7, we explore the impact of frame number and BEV resolution on performance. We can observe that the optimal BEV size $(W^{b},H^{b})$ is $512\times 512$ . Meanwhile, too small BEV resolution would cause the network to be unable to capture the motion of small objects, while excessively large resolution leads to sensitivity to slight disturbances. Besides, a larger BEV image will contain more numerous empty grids, which may dilute useful information.

Furthermore, as shown in Fig. 7(b), compared to previous approaches [8, 9] that require a lot of frames to extract the spatial-temporal features, our method only relies on 3 frames to achieve the best result. We think this is due to the effective reuse of historical feature and predictions in temporal fusion and voting, which has taken rich prior knowledge to network. Meanwhile, too many frames are fed into the network would cause information redundancy and result in degradation.

V Conclusion

In this paper, we analyse the limitations of existing MOS methods and propose a novel streaming structure, which uses memory bank as a bridge to transfer prior information among inferences. Moreover, our StreamMOS captures the complete appearance and motion features of objects from multi-view. To correct false prediction in a single inference, we propose a voting mechanism to integrate historical predictions at voxel and instance levels. The broad experimental results show the effectiveness of proposed modules and prove that our method has competitive performance in diverse aspects.

References

[1] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-time.,” in Robotics: Science and systems, vol. 2, pp. 1–9, Berkeley, CA, 2014.
[2] B. Guo, N. Guo, and Z. Cen, “Obstacle avoidance with dynamic avoidance risk region for mobile robots in dynamic environments,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 5850–5857, 2022.
[3] P. Chen, J. Pei, W. Lu, and M. Li, “A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance,” Neurocomputing, vol. 497, pp. 64–75, 2022.
[4] X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss, “Moving object segmentation in 3d lidar data: A learning-based approach exploiting sequential data,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6529–6536, 2021.
[5] J. Sun, Y. Dai, X. Zhang, J. Xu, R. Ai, W. Gu, and X. Chen, “Efficient spatial-temporal information fusion for lidar-based 3d moving object segmentation,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11456–11463, IEEE, 2022.
[6] S. Mohapatra, M. Hodaei, S. Yogamani, S. Milz, H. Gotzig, M. Simon, H. Rashed, and P. Maeder, “Limoseg: Real-time bird’s eye view based lidar motion segmentation,” arXiv preprint arXiv:2111.04875, 2021.
[7] B. Zhou, J. Xie, Y. Pan, J. Wu, and C. Lu, “Motionbev: Attention-aware online lidar moving object segmentation with bird’s eye view based appearance and motion features,” arXiv preprint arXiv:2305.07336, 2023.
[8] N. Wang, C. Shi, R. Guo, H. Lu, Z. Zheng, and X. Chen, “Insmos: Instance-aware moving object segmentation in lidar data,” arXiv preprint arXiv:2303.03909, 2023.
[9] B. Mersch, X. Chen, I. Vizzo, L. Nunes, J. Behley, and C. Stachniss, “Receding moving object segmentation in 3d lidar data using sparse 4d convolutions,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7503–7510, 2022.
[10] J. Schauer and A. Nüchter, “The peopleremover—removing dynamic objects from 3-d point cloud data by traversing a voxel occupancy grid,” IEEE robotics and automation letters, vol. 3, no. 3, pp. 1679–1686, 2018.
[11] S. Pagad, D. Agarwal, S. Narayanan, K. Rangan, H. Kim, and G. Yalla, “Robust method for removing dynamic objects from point clouds,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10765–10771, IEEE, 2020.
[12] F. Pomerleau, P. Krüsi, F. Colas, P. Furgale, and R. Siegwart, “Long-term 3d map maintenance in dynamic environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3712–3719, 2014.
[13] G. Kim and A. Kim, “Remove, then revert: Static point cloud map construction using multiresolution range images,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10758–10765, IEEE, 2020.
[14] R. Ambruş, N. Bore, J. Folkesson, and P. Jensfelt, “Meta-rooms: Building and maintaining long term spatial models in a dynamic world,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1854–1861, 2014.
[15] H. Lim, S. Hwang, and H. Myung, “Erasor: Egocentric ratio of pseudo occupancy-based dynamic object removal for static 3d point cloud map building,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2272–2279, 2021.
[16] T. Kreutz, M. Mühlhäuser, and A. S. Guinea, “Unsupervised 4d lidar moving object segmentation in stationary settings with multivariate occupancy time series,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1644–1653, 2023.
[17] J. Kim, J. Woo, and S. Im, “Rvmos: Range-view moving object segmentation leveraged by semantic and motion features,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8044–8051, 2022.
[18] X. Li, G. Zhang, H. Pan, and Z. Wang, “Cpgnet: Cascade point-grid fusion network for real-time lidar semantic segmentation,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 11117–11123, IEEE, 2022.
[19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[20] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” 2021.
[21] R. Li, S. Li, X. Chen, T. Ma, W. Hao, J. Gall, and J. Liang, “Tfnet: Exploiting temporal cues for fast and accurate lidar semantic segmentation,” arXiv preprint arXiv:2309.07849, 2023.
[22] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, pp. 226–231, 1996.
[23] M. Berman, A. R. Triki, and M. B. Blaschko, “The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4413–4421, 2018.
[24] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
[25] J. Cheng, K. Zeng, Z. Huang, X. Tang, J. Wu, C. Zhang, X. Chen, and R. Fan, “Mf-mos: A motion-focused model for moving object segmentation,” arXiv preprint arXiv:2401.17023, 2024.