Xia Text2Loc 3D Point Cloud Localization From Natural Language CVPR 2024 Paper

Uploaded by

nlqbao2416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views10 pages

Xia Text2Loc 3D Point Cloud Localization From Natural Language CVPR 2024 Paper

Uploaded by

nlqbao2416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Text2Loc: 3D Point Cloud Localization from Natural Language

†
Yan Xia∗1,2 Letian Shi∗1 Zifeng Ding3 João F. Henriques4 Daniel Cremers1,2
1 2
Technical University of Munich Munich Center for Machine Learning (MCML)
3
LMU Munich 4 Visual Geometry Group, University of Oxford
{yan.xia, letian.shi, cremers}@tum.de, [email protected], [email protected]

Hi, I am standing on the

Localization Recall (%)

west of a green building,
east of a green road,
west of a black garage...

Got it!
Coming soon! Number of top retrievals
Figure 1. (Left) We introduce Text2Loc, a solution designed for city-scale position localization using textual descriptions. When provided
with a point cloud representing the surroundings and a textual query describing a position, Text2Loc determines the most probable location
of that described position within the map. (Right) Localization performance on the KITTI360Pose test set. The proposed Text2Loc achieves
consistently better performance across all top retrieval numbers. Notably, it outperforms the best baseline by up to 2 times, localizing text
queries below 5 m.

Abstract 1. Introduction
3D localization [18, 28] using natural language descrip-
We tackle the problem of 3D point cloud localization tions in a city-scale map is crucial for enabling autonomous
based on a few natural linguistic descriptions and intro- agents to cooperate with humans to plan their trajecto-
duce a novel neural network, Text2Loc, that fully interprets ries [11] in applications such as goods delivery or vehicle
the semantic relationship between points and text. Text2Loc pickup [36, 38]. When delivering a takeaway, couriers of-
follows a coarse-to-fine localization pipeline: text-submap ten encounter the “last mile problem”. Pinpointing the ex-
global place recognition, followed by fine localization. In act delivery spot in residential neighborhoods or large office
global place recognition, relational dynamics among each buildings is challenging since GPS signals are bound to fail
textual hint are captured in a hierarchical transformer with among tall buildings and vegetation [34, 37]. Couriers of-
max-pooling (HTM), whereas a balance between positive ten rely on voice instructions over the phone from the recip-
and negative pairs is maintained using text-submap con- ient to determine this spot. More generally, the “last mile
trastive learning. Moreover, we propose a novel matching- problem” occurs whenever a user attempts to navigate to an
free fine localization method to further refine the location unfamiliar place. It is therefore essential to develop the ca-
predictions, which completely removes the need for compli- pability to perform localization from the natural language,
cated text-instance matching and is lighter, faster, and more as shown in Fig. 1.
accurate than previous methods. Extensive experiments As a possible remedy, we can match linguistic descrip-
show that Text2Loc improves the localization accuracy by tions to a pre-built point cloud map using calibrated depth
up to 2× over the state-of-the-art on the KITTI360Pose sensors like LiDAR. Point cloud localization, which focuses
dataset. Our project page is publicly available at https: on the scene’s geometry, offers several advantages over im-
//yan-xia.github.io/projects/text2loc/. ages. It remains consistent despite lighting, weather, and
season changes, whereas the same geometric structure in
images might appear vastly different.
The main challenge of 3D localization from natural
† Corresponding author. * Equal contribution. language descriptions lies in accurately interpreting the

14958
language and semantically understanding large-scale point chical and represents contextual details within and across
clouds. To date, only a few networks have been pro- sentence descriptions of places.
posed for language-based localization in a 3D large-scale • We study the importance of positive-negative pairs bal-
city map. Text2Pose [12] is a pioneering work that aligns ance in this setting, and show how contrastive learning is
objects described in text with their respective instances in an effective tool that significantly improves performance.
a point cloud, through a coarse-to-fine approach. In the • We are the first to completely remove the usage of text-
coarse stage, Text2Pose first adopts a text-to-cell cross- instance matcher in the final localization stage. We pro-
model retrieval method to identify the possible regions that pose a lighter and faster localization model while still
contain the target position. In particular, Text2Pose matches achieving state-of-the-art performance via our designed
the text and the corresponding submaps by the global de- prototype-based map cloning (PMC) module in training
scriptors from 3D point clouds using PointNet++ [20] and and cascaded cross-attention transformer (CCAT).
the global text descriptors using a bidirectional LSTM cell • We conduct extensive experiments on the KITTI360Pose
[10, 25]. This method describes a submap with its contained benchmark [12] and show that the proposed Text2Loc
instances of objects, which ignores the instance relation- greatly improves over the state-of-the-art methods.
ship for both points and sentences. Recently, the authors
of RET [33] noted this shortcoming and designed Relation- 2. Related work
Enhanced Transformer networks. While this results in bet-
ter global descriptors, both approaches match global de- To date, only a few networks have been proposed for the
scriptors using the pairwise ranking loss without consider- natural language based 3D localization in a large-scale out-
ing the imbalance in positive and negative samples. door scene. Other tasks that are related to ours include 2D
Inspired by RET [33], we also notice the importance of visual localization, 3D point cloud based localization, and
effectively leveraging relational dynamics among instances 3D understanding with language.
within submaps for geometric representation extraction. 2D visual localization. Visual localization in 2D images
Furthermore, there is a natural hierarchy in the descriptions, has wide-ranging applications from robotics to augmented
composed of sentences, each with word tokens. We thus reality. Given a query image or image sequence, the aim
recognize the need to analyze relationships within (intra- is to predict an accurate pose. One of the early works,
text) and between (inter-text) descriptions. To address these Scale-Invariant Feature Transform (SIFT) [15], proposes
challenges, we adopt a frozen pre-trained large language the use of distinctive invariant features to match objects
model T5 [23] and design a hierarchical transformer with across different viewpoints, forming a basis for 2D local-
max-pooling (HTM) that acts as an intra- and inter-text en- ization. Oriented FAST and Rotated BRIEF (ORB) [24]
coder, capturing the contextual details within and across has been pivotal in achieving robustness against scale, ro-
sentences. Additionally, we enhance the instance encoder tation, and illumination changes in 2D localization tasks.
in Text2Pose [12] by adding a number encoder and adopt- Recent learning-based methods [26, 29] commonly adopt a
ing contrastive learning to maintain a balance between pos- coarse-to-fine pipeline. In the coarse stage, given a query
itive and negative pairs. Another observation is that, when image, place recognition is performed as nearest neighbor
refining the location prediction in the fine localization stage, search in high-dimensional spaces. Subsequent to this, a
the widely used text-instance matching module in previous pixel-wise correspondence is ascertained between the query
methods should be reduced since the noisy matching or in- and the retrieved image, facilitating precise pose prediction.
accurate offset predictions are a fatal interference in pre- However, the performance of image-based methods often
dicting the exact position of the target. To address this is- degrades when facing drastic variations in illumination and
sue, we propose a novel matching-free fine localization net- appearance caused by weather and seasonal changes. Com-
work. Specifically, we first design a prototype-based map pared to feature matching in 2D visual localization, in this
cloning (PMC) module to increase the diversity of retrieved work, we aim to solve cross-model localization between
submaps. Then, we introduce a cascaded cross-attention text and 3D point clouds.
transformer (CCAT) to enrich the text embedding by fusing 3D point cloud based localization. With breakthroughs in
the semantic information from point clouds. These opera- learning-based image localization methods, deep learning
tions enable one-stage training to directly predict the target of 3D localization has become the focus of intense research.
position without any text-instance matcher. Similar to image-based methods, a two-step pipeline is
To summarize, the main contributions of this work are: commonly used in 3D localization: (1) place recognition,
• We focus on the relatively-understudied problem of point followed by (2) pose estimation. PointNetVlad [2] is a pio-
cloud localization from textual descriptions, to address neering network that tackles 3D place recognition with end-
the “last mile problem”. to-end learning. Subsequently, SOE-Net [35] introduces the
• We propose a novel attention-based method that is hierar- PointOE module, incorporating orientation encoding into

14959
Text descriptions Global place recognition Predicted Position
Text-to- Top-k submaps
𝑇! : The pose is on-top of a gray road. submap
𝑇! !"#$%"&'(%")("*%(+"',"-"./-0"&-/1)2.3" retrieval Position
𝑇" !"#$%"&'(%")("%-(+"',"-"4-/15./%%2"6%.%+-+)'2 estimation
𝑇# !"#$%"&'(%")("('7+$"',"-"./-05./%%2"&'8%3"
𝑇$ !"#$%"&'(%")("2'/+$"',"-"98-:1"&'8%3" Instances in retrieval submaps
𝑇% !"#$%"&'(%")("*%(+"',"-"98-:1"6%.%+-+)'23
Fine localization

Figure 2. The proposed Text2Loc architecture. It consists of two tandem modules: Global place recognition and Fine localization. Global
place recognition. Given a text-based position description, we first identify a set of coarse candidate locations, ”submaps,” potentially
containing the target position. This is achieved by retrieving the top-k nearest submaps from a previously constructed database of submaps
using our novel text-to-submap retrieval model. Fine localization. We then refine the center coordinates of the retrieved submaps via our
designed matching-free position estimation module, which adjusts the target location to increase accuracy.

PointNet to generate point-wise local descriptors. Further- task in a coarse-to-fine manner. The text-submap global
more, various methods [3, 6, 8, 16, 17, 40, 41] have ex- place recognition involves the retrieval of submaps based
plored the integration of different transformer networks, on T . This stage aims to train a function F , which encodes
specifically stacked self-attention blocks, to learn long- both T and a submap m into a unified embedding space.
range contextual features. In contrast, Minkloc3D [13] em- In this space, matched query-submap pairs are brought
ploys a voxel-based strategy to generate a compact global closer together, while unmatched pairs are repelled. In
descriptor using a Feature Pyramid Network [14] (FPN) fine-grained localization, we employ a matching-free
with generalized-mean (GeM) pooling [21]. However, the network to directly regress the final position of the target
voxelization methods inevitably suffer from lost points due based on T and the retrieved submaps. Thus, the task of
to the quantization step. CASSPR [37] thus introduces a training a 3D localization network from natural language is
dual-branch hierarchical cross attention transformer, com- defined as identifying the ground truth position (x, y) (2D
bining both the advantages of voxel-based approaches with planar coordinates w.r.t. the scene coordinate system) from
the point-based approaches. After getting the coarse loca- Mref :
tion of the query scan, the pose estimation can be computed
with the point cloud registration algorithms, like the itera- \underset {\phi ,F}{\min }\,\underset {(x,y,T)\sim \mathcal {D}}{\mathbb {E}}\left \Vert (x,y)-\phi \left (T,\;\underset {m\in M_{\textrm {ref}}}{{\rm argmin}}\:d\left (F(T),F(m)\right )\right )\right \Vert ^{2} \label {Eq:encoder}
tive closest point (ICP) [30] or autoencoder-based registra-
tion [7]. By contrast to point cloud based localization, we (1)
use natural language queries to specify any target location. where d(·, ·) is a distance metric (e.g. the Euclidean dis-
tance), D is the dataset, and ϕ is a neural network that is
3D vision and language. Recent work has explored the
trained to output fine-grained coordinates from a text em-
cross-modal understanding of 3D vision and language. [19]
bedding T and a submap m.
bridges language implicitly to 3D visual feature represen-
tations and predicts 3D bounding boxes for target objects. 4. Methodology
Methods [1, 4, 9, 39] locate the most relevant 3D target
objects in a raw point cloud scene given by the query text Fig. 2 shows our Text2Loc architecture. Given a text-based
descriptions. However, these methods focus on real-world query position description, we aim to find a set of coarse
indoor scene localization. Text2Pos [12] is the first attempt candidate submaps that potentially contain the target posi-
to tackle the large city-scale outdoor scene localization task, tion by using a frozen pre-trained T5 language model [23]
which identifies a set of coarse locations and then refines the and an intra- and inter-text encoder with contrastive learn-
pose estimation. Following this, Wang et al. [33] propose ing, described in Section 4.1. Next, we refine the location
a Transformer-based method to enhance representation dis- based on the retrieved submaps via a designed fine localiza-
criminability for both point clouds and textual queries. tion module, which will be explained in Section 4.2. Sec-
tion 4.3 describes the training with the loss function.
3. Problem statement 4.1. Global place recognition
We begin by defining the large-scale 3D map 3D point cloud-based place recognition is usually expressed
Mref = {mi : i = 1, ..., M } to be a collection of cu- as a 3D retrieval task. Given a query LiDAR scan, the aim is
bic submaps mi . Each submap mi = {Pi,j : j = 1, ..., p} to retrieve the closest match and its corresponding location
includes a set of 3D object instances Pi,j . Let T be a query from the database by matching its global descriptor against
text description consisting of a set of hints {⃗hk }hk=1 , each the global descriptors extracted from a database of reference
describing the spatial relation between the target location scans based on their descriptor distances. Following this
and an object instance. Following [12], we approach this general approach, we adopt the text-submap cross-modal

14960
Attention+

Cross Attention

Cross Attention
Textural position Pretrained
Pretrained T5 Intra- and inter- Text descriptions
… Max-pooling

Cascaded

Cascaded
descriptions T5 model ❆
𝑇! , 𝑇" , ⋯ 𝑇$ model ❆ text encoder
𝑇! 𝑇" 𝑇# … 𝑇$ PMC ⋯
Instance
encoder MLP
Instances in submaps S! 𝑆! · 𝑇! 𝑆! · 𝑇" 𝑆! · 𝑇# ⋯ 𝑆! · 𝑇$
Instances in a submap Predicted Position
𝑃! S" 𝑆" · 𝑇! 𝑆" · 𝑇" 𝑆" · 𝑇# ⋯ 𝑆" · 𝑇$
Prototype-based map cloning (PMC)
Attention
Instance S# 𝑆# · 𝑇! 𝑆# · 𝑇" 𝑆# · 𝑇# ⋯ 𝑆# · 𝑇$
𝑃" +
encoder
… Max-pooling
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
𝑃& ⋯ Map Variant Map Picking
𝑆! , 𝑆" , ⋯ 𝑆$ S$ 𝑆$ · 𝑇! 𝑆% · 𝑇" 𝑆% · 𝑇# 𝑆$ · 𝑇$
Generation
Submap Random select
Instance encoder Intra- and inter-text
encoder
Instance point (XYZ, RGB) PointNet++
Figure 4. The proposed matching-free fine localization architec-

Feature 𝐹%!
Projection

Transformer

Transformer
Multi-head

Multi-head
Max Pooling

Max Pooling
Instance Color Color Encoder
+ ture. It consists of two parallel branches: one is extracting fea-
Instance Center Center Encoder
tures from query text descriptions (top) and another is using the
Instance Number Num. Encoder
instance encoder to extract point cloud features (bottom). Cas-
Figure 3. (top) The architecture of global place recognition, (bot- caded cross-attention transformers (CCAT) use queries from one
tom) instance encoder architecture for point clouds, and the archi- branch to look up information in the other branch, aiming to fuse
tecture of intra- and inter-text encoder. Note that the pre-trained the semantic information from point clouds into the text embed-
T5 model is frozen. ding. The result is then processed with a simple MLP to directly
estimate the target position.
global place recognition for coarse localization. With this
stage, we aim to retrieve the nearest submap in response to We merge the semantic, color, positional, and quantity em-
a textual query. The main challenge lies in how to find si- beddings through concatenation and process them with a
multaneously robust and distinctive global descriptors for projection layer, another 3-layer MLP. This projection layer
3D submaps S and textual queries T . Similar to [12, 33], produces the final instance embedding Fpi . Finally, we
we employ a dual branch to encode S and T into a shared Np
aggregate in-submap instance descriptors {Fpi }i=1 into a
embedding space, as shown in Fig. 3 (top). S
global submap descriptor F using an attention layer [35]
Text branch. We initially use a frozen pre-trained large followed by a max pooling operation.
language model, T5 [23], to extract nuanced features from Text-submap Contrastive learning. We introduce a
textual descriptions, enhancing the embedding quality. We cross-modal contrastive learning objective to address the
then design a hierarchical transformer with max-pooling limitations of the widely used pairwise ranking loss in
layers to capture the contextual details within sentences [12, 33]. This objective aims to jointly drive closer the fea-
(via self-attention) and across them (via the semantics that ture centroids of 3D submaps and the corresponding text
are shared by all sentences), as depicted in Fig. 3 (Bot- prompt. In our overall architecture, illustrated in Figure 3,
tom right). Each transformer is a residual module compris- we incorporate both a text encoder and a point cloud en-
ing Multi-Head Self-Attention (MHSA) and FeedForward coder. These encoders serve the purpose of embedding the
Network (FFN) sublayers. The feed-forward network com- text-submap pairs into text features denoted as F T ∈ R1×C
prises two linear layers with the ReLU activation function. and 3D submap features represented as F S ∈ R1×C , re-
More details are in the Supplementary Materials. spectively. Here, C signifies the embedding dimension. In-
3D submap branch. Each instance Pi in the submap spired by CLIP [22], we computer the feature distance be-
SN is represented as a point cloud, containing both spa- tween language descriptions and 3D submaps with a con-
tial and color (RGB) coordinates, resulting in 6D features trastive learning loss (See Sec. 4.3 for details).
(Fig. 3 (bottom left)). We utilize PointNet++ [20] (which
4.2. Fine localization
can be replaced with a more powerful encoder) to extract
semantic features from the points. Additionally, we obtain Following the text-submap global place recognition, we aim
a color embedding by encoding RGB coordinates with our to refine the target location prediction within the retrieved
color encoder and a positional embedding by encoding the submaps in fine localization. Although the final localiza-
instance center P̄i (i.e., the mean coordinates) with our po- tion network in previous methods [12, 33] achieved notable
sitional encoder. We find that object categories consistently success using a text-submap matching strategy, the inherent
differ in point counts; for example, roads typically (> 1000 ambiguity in the text descriptions significantly impeded ac-
points) have a higher point count than poles (< 500 points). curate offset predictions for individual object instances. To
We thus design a number encoder, providing potential class- address this issue, we propose a novel matching-free fine
specific prior information by explicitly encoding the point localization network, as shown in Fig. 4. The text branch
numbers. All the color, positional, and number encoders are (top) captures the fine-grained features by using a frozen
3-layer multi-layer perceptrons (MLPs) with output dimen- pre-trained language model T5 [23] and an attention unit
sions matching the semantic point embedding dimension. followed by a max pooling layer. The submap branch (bot-

14961
tom) performs a prototype-based map cloning module to in- trastive loss among each pair is computed as follows,
crease more map variants and then extracts the point cloud
\footnotesize l(i,T,S) = -\log \frac {\exp (F^T_i\cdot F^S_{i}/\tau )}{\sum \limits _{j\in N} \exp (F_i^T\cdot F_j^S/\tau )} - \log \frac {\exp (F^S_i\cdot F^T_{i}/\tau )}{\sum \limits _{j\in N} \exp (F_i^S\cdot F_j^T/\tau )} ,
features using an instance encoder, the same as in the global
place recognition. We then fuse the text-submap feature
(3)
with a Cascaded Cross-Attention Transformer and finally
where τ is the temperature coefficient, similar to CLIP [22].
regress the target position via a simple MLP.
Within a training mini-batch, the text-submap alignment ob-
Cascaded Cross-attention Transformer (CCAT). To jective L(T, S) can be described as:
efficiently exploit the relationship between the text branch
and the 3D submap branch, we propose a CCAT to fuse the \small L(T,S) = \frac {1}{N}\left [ \sum _{i\in N} l(i, T, S) \right ]. (4)
features from the two branches. The CCAT consists of two
Cross Attention Transformers (CAT), each is the same as
Fine localization. Unlike previous method [12, 33], our
in [37]. The CAT1 takes the point cloud features as Query
fine localization network does not include a text-instance
and the text features as Key and Value. It extracts text fea-
matching module, making our training more straightfor-
tures with reference to the point features and outputs point
ward and faster. Note that this model is trained separately
feature maps that are informed by the text features. Con-
from the global place recognition. Here, our goal is to min-
versely, CAT2 produces enhanced text features by taking
imize the distance between the predicted location of the tar-
the text features as the Query and the enhanced point cloud
get and the ground truth. In this paper, we use only the mean
features from CAT1 as the Key and Value. Notably, the
squared error loss Lr to train the translation regressor.
CAT1 and the CAT2 are a cascading structure, which is the
main difference from the HCAT in [37]. In this work, two \begin {aligned} L(C_{gt}, {C}_{pred}) = \big \|C_{gt} - {C}_{pred} \big \|_{2}, \end {aligned} \label {Eq: regression loss} (5)
cascaded CCATs are used. More ablation studies and anal-
yses are in the Supplementary Materials. where Cpred = (x, y) (see Eq. (1)) is the predicted target
coordinates, and Cgt is the ground-truth coordinates.
Prototype-based Map Cloning (PMC). To produce
more effective submap variants for training, we propose a 5. Experiments
novel prototype-based map cloning module. For each pair
{Ti , Si }, we hope to generate a collection Gi of surrounding 5.1. Benchmark Dataset
map variants centered on the current map Si , which can be
We train and evaluate the proposed Text2Loc on the
formulated as follows:
KITTI360Pose benchmark presented in [12]. It includes
point clouds of 9 districts, covering 43,381 position-query
pairs with a total area of 15.51 km2 . Following [12], we
\begin {aligned} \mathcal {G}_i = \{S_j \; | \; & \big \| {\Bar {s_j} - \Bar {s_i}} \big \|_{\infty }< \alpha , \; \big \|\Bar {s_j} - c_i \big \|_{\infty } < \beta \; \}, \end {aligned} (2) choose five scenes (11.59 km2 ) for training, one for valida-
tion, and the remaining three (2.14 km2 ) for testing. The
where s¯i , s¯j are the center coordinates of the submaps Si 3D submap is a cube that is 30m long with a stride of 10m.
and Sj respectively. ci represents the ground-truth target This creates a database with 11,259/1,434/4,308 submaps
position described by Ti , α and β are the pre-defined thresh- for training/validation/testing scenes and a total of 17,001
olds. In this work, we set α = 15 and β = 12. submaps for the entire dataset. For more details, please re-
fer to the supplementary material in [12].
In practice, we find that certain submaps in Gi have an
insufficient number of object instances corresponding to the 5.2. Evaluation criteria
textual descriptions Ti . To address this, we introduce a fil-
Following [12], we use Retrieve Recall at Top k (k ∈
tering process by setting a minimum threshold Nm = 1.
{1, 3, 5}) to evaluate text-submap global place recognition.
This threshold implies that at most one instance mismatch
For assessing localization performance, we evaluate with
is permissible. After applying this filter, we randomly se-
respect to the top k retrieved candidates (k ∈ {1, 5, 10})
lected a single submap from the refined Gi for training.
and report localization recall. Localization recall measures
the proportion of successfully localized queries if their er-
4.3. Loss function ror falls below specific error thresholds, specifically ϵ <
5/10/15m by default.
Global place recognition. Different from the pairwise
ranking loss widely used in previous methods [12, 33], we 5.3. Results
train the proposed method for text-submap retrieval with a
5.3.1 Global place recognition
cross-model contrastive learning objective. Given an input
batch of 3D submap descriptors {FiS }Ni=1 and matching text We compare our Text2Loc with the state-of-the-art meth-
descriptors {FiT }N
i=1 where N is the batch size, the con- ods: Text2Pos [12] and RET [33]. We evaluate global place

14962
Localization Recall (ϵ < 5/10/15m) ↑
Methods Validation Set Test Set
k=1 k=5 k = 10 k=1 k=5 k = 10
Text2Pos [12] 0.14/0.25/0.31 0.36/0.55/0.61 0.48/0.68/0.74 0.13/0.21/0.25 0.33/0.48/0.52 0.43/0.61/0.65
RET [33] 0.19/0.30/0.37 0.44/0.62/0.67 0.52/0.72/0.78 0.16/0.25/0.29 0.35/0.51/0.56 0.46/0.65/0.71
Text2Loc (Ours) 0.37/0.57/0.63 0.68/0.85/0.87 0.77/0.91/0.93 0.33/0.48/0.52 0.61/0.75/0.78 0.71/0.84/0.86
Table 1. Performance comparison on the KITTI360Pose benchmark [12].

Submap Retrieval Recall ↑ of-the-art methods. We also show some qualitative results
Methods Validation Set Test Set in Section 6.2.
k=1 k=3 k=5 k=1 k=3 k=5
Text2Pos [12] 0.14 0.28 0.37 0.12 0.25 0.33 6. Performance analysis
RET [33] 0.18 0.34 0.44 - - -
Text2Loc (Ours) 0.32 0.56 0.67 0.28 0.49 0.58 6.1. Ablation study
Table 2. Performance comparison for gloabl place recognition on The following ablation studies evaluate the effectiveness of
the KITTI360Pose benchmark [12]. Note that only values that are different components of Text2Loc, including both the text-
available in RET [33] are reported. submap global place recognition and fine localization.
recognition performance on the KITTI360Pose validation Global place recognition. To assess the relative contri-
and test set for a fair comparison. Table 2 shows the top- bution of each module, we remove the frozen pre-trained
1/3/5 recall of each method. The best performance on the large language model T5, hierarchical transformer with
validation set reaches the recall of 0.32 at top-1. Notably, max-pooling (HTM) module in the text branch, and number
this outperforms the recall achieved by the current state-of- encoder in the 3D submap branch from our network one by
the-art method RET by a wide margin of 78%. Further- one. We also analyze the performance of the proposed text-
more, Text2Loc achieves recall rates of 0.56 and 0.67 at submap contrastive learning. All networks are trained on
top-3 and top-5, respectively, representing substantial im- the KITTI360Pose dataset, with results shown in Table. 3.
provements of 65% and 52% relative to the performance Utilizing the frozen pre-trained LLM T5, we observed an
of RET. These improvements are also observed in the test approximate 8% increase in retrieval accuracy at top 1 on
set, indicating the superiority of the method over baseline the test set. While the HTM notably enhances performance
approaches. Note that we report only the values available on the validation set, it shows marginal improvements on
in the original publication of RET. These improvements the test set. Additionally, integrating the number encoder
demonstrate the efficacy of our proposed Text2Loc in cap- has led to a significant 6% improvement in the recall metric
turing cross-model local information and generating more at top 1 on the validation set. Notably, the performance on
discriminative global descriptors. More qualitative results the validation/test set reaches 0.32/0.28 recall at top 1, ex-
are given in Section 6.2. ceeding the same model trained with the pairwise ranking
loss by 52% and 40%, respectively, highlighting the superi-
ority of the proposed contrastive learning approach.
5.3.2 Fine localization
Fine localization. To analyze the effectiveness of
To improve the localization accuracy of the network, [12, each proposed module in our matching-free fine-grained
33] further introduce fine localization. To make the com- localization, we separately evaluate the Cascaded Cross-
parisons fair, we follow the same setting in [12, 33] to train Attention Transformer (CCAT) and Prototype-based Map
our fine localization network. As illustrated in Table 1, we Cloning (PMC) module, denoted as Text2Loc CCAT and
report the top-k (k = 1/5/10) recall rate of different er- Text2Loc PMC. For a fair comparison, all methods utilize
ror thresholds ϵ < 5/10/15m for comparison. Text2Loc the same submaps retrieved from our global place recogni-
achieves the top-1 recall rate of 0.37 on the validation set tion. The results are shown in Table. 4. Text2Pos* signifi-
and 0.33 on the test set under error bound ϵ < 5m, which cantly outperforms the origin results of Text2Pos [12], indi-
are 95% and 2 × higher than the previous state-of-the-art cating the superiority of our proposed global place recogni-
RET, respectively. Furthermore, our Text2Loc performs tion. Notably, replacing the matcher in Text2Pos [12] with
consistently better when relaxing the localization error con- our CCAT results in about 7% improvements at top 1 on
straints or increasing k. This demonstrates that Text2Loc the test set. We also observe the inferior performance of
can accurately interpret the text descriptions and semanti- Text2Loc PMC to the proposed method when interpreting
cally understand point clouds better than the previous state- only the proposed PMC module into the Text2Pos [12] fine

14963
Submap Retrieval Recall ↑ Localization Recall (ϵ < 5m) ↑
Methods Validation Set Test Set Methods Validation Set Test Set

k=1 k=3 k=5 k=1 k=3 k=5 k=1 k=5 k = 10 k=1 k=5 k = 10
Text2Pos [12] 0.14 0.36 0.48 0.13 0.33 0.43
w/o T5 0.29 0.53 0.65 0.26 0.45 0.54
Text2Pos* 0.33 0.65 0.75 0.30 0.58 0.67
w/o HTM 0.30 0.54 0.65 0.28 0.48 0.57
Text2Loc CCAT 0.32 0.64 0.74 0.32 0.60 0.70
w/o CL 0.21 0.42 0.53 0.20 0.36 0.45
Text2Loc PMC 0.32 0.64 0.74 0.29 0.56 0.66
w/o NE 0.30 0.52 0.63 0.27 0.47 0.56
Text2Loc (Ours) 0.37 0.68 0.77 0.33 0.61 0.71
Full (Ours) 0.32 0.56 0.67 0.28 0.49 0.58
Table 4. Ablation study of the fine localization on the
Table 3. Ablation study of the global place recognition on KITTI360Pose benchmark. * indicates the fine localization net-
KITTI360Pose benchmark. ”w/o T5” indicates replacing the work from Text2Pose [12], and the submaps retrieved through our
frozen pre-trained T5 model with the LSTM in [12]. ”w/o HTM” global place recognition. Text2Loc CCAT indicates the removal
indicates removing the proposed hierarchical transformer with of only the PMC while retaining the CCAT in our network. Con-
max-pooling (HTM). ”w/o CL” indicates replacing the proposed versely, Text2Loc PMC keeps the PMC but replaces the CCAT
contrastive learning with the widely used pairwise ranking loss. with the text-instance matcher in Text2Pos.
”w/o NE” indicates reducing the number encoder in the instance
encoder of 3D submap branch.
Methods Parameters (M) Runtime (ms) Localization Recall
Text2Loc Matcher 2.08 43.11 0.30
localization network. The results are consistent with our
Text2Loc (Ours) 1.06 2.27 0.33
expectations since PMC can lead to the loss of object in-
stances in certain submaps (See Supp.). Combining both Table 5. Computational cost requirement analysis of our fine lo-
modules achieves the best performance, improving the per- calization network on the KITTI360Pose test dataset.
formance by 10% at top 1 on the test set. This demonstrates
adding more training submaps by PMC is beneficial for our ficiency. For a fair comparison, all methods are tested on
matching-free strategy without any text-instance matches. the KITTI360Pose test set with a single NVIDIA TITAN X
(12G) GPU. Text2Loc takes 22.75 ms and 12.37 ms to ob-
6.2. Qualitative analysis tain a global descriptor for a textual query and a submap
In addition to quantitative results, we show some qualita- respectively, while Text2Pos [12] achieves it in 2.31 ms and
tive results of two correctly point cloud localization from 11.87 ms. Text2Loc has more running time for the text
text descriptions and one failure case in Fig. 5. Given a query due to the extra frozen T5 (21.18 ms) and HTM mod-
query text description, we visualize the ground truth, top-3 ule (1.57 ms). Our text and 3D networks have 13.65 M
retrieved submaps, and our fine localization results. In text- (without T5) and 1.84 M parameters respectively. For fine
submap global place recognition, a retrieved submap is de- localization, we replace the proposed matching-free CCAT
fined as positive if it contains the target location. Text2Loc module with the text-instance matcher in [12, 33], denoted
excels in retrieving the ground truth submap or those near as Text2Loc Matcher. From Table. 5, we observe that
in most cases. However, there are instances where nega- Text2Loc is nearly two times more parameter-efficient than
tive submaps are retrieved, as observed in (b) with the top the baselines [12, 33] and only uses their 5% inference time.
3. Text2Loc showcases its ability to predict more accurate The main reason is that the previous methods adopt Super-
locations based on positively retrieved submaps in fine lo- glue [27] as a matcher, which resulted in a heavy and time-
calization. We also present one failure case in (c), where all consuming process. Besides, our matching-free architec-
retrieved submaps are negative. In these scenarios, our fine ture prevents us from running the Sinkhorn algorithm [5].
localization struggles to predict accurate locations, high- These improvements significantly enhance the network’s ef-
lighting its reliance on the coarse localization stage. An ficiency without compromising its performance.
additional observation is that despite their distance from the
target location, all these negative submaps contain instances
6.4. Robustness analysis
similar to the ground truth. These observations show the In this section, we analyze the effect of text changes on
challenge posed by the low diversity of outdoor scenes, em- localization accuracy. For a clear demonstration, we only
phasizing the need for highly discriminative representations change one sentence in the query text descriptions, de-
to effectively disambiguate between submaps. noted as Text2Loc modified. All networks are evaluated on
the KITTI360Pose test set, with results shown in Table. 6.
6.3. Computational cost analysis
Text2Loc modified only achieves the recall of 0.15 at top-1
In this section, we analyze the required computational re- retrieval, indicating our text-submap place recognition net-
sources of our coarse and matching-free fine localization work is very sensitive to the text embedding. We also ob-
network regarding the number of parameters and time ef- serve the inferior performance of Text2Loc modified in the

14964
Text Ground Global place recognition Fine
descriptions truth Top 1 Top 2 Top 3 localization
The pose is on top of a gray road.
The pose is east of a beige sidewalk.
The pose is south of a beige wall.
(a) The pose is west of a black fence.
The pose is west of black vegetation.
The pose is north of a black terrain.

The pose is on top of a gray-green road.

The pose is north of a gray sidewalk.
The pose is on top of a gray parking.
(b) The pose is east of gray vegetation.
The pose is north of a gray-green terrain.
The pose is east of a dark-green pole.

The pose is east of a gray road.

The pose is on top of a gray sidewalk.
The pose is on top of black vegetation.
(c) The pose is south of a dark-green terrain.
The pose is south of a dark-green pole.
The pose is south of a dark-green traffic sign.

Figure 5. Qualitative localization results on the KITTI360Pose dataset: In global place recognition, the numbers in top3 retrieval submaps
represent center distances between retrieved submaps and the ground truth. Green boxes indicate positive submaps containing the target
location, while red boxes signify negative submaps. For fine localization, red and black dots represent the ground truth and predicted target
locations, with the red number indicating the distance between them.
Text2Pos
Test set
Methods Submap Retrieval Recall Localization Recall
k=1 k=3 k=5 k = 5 (ϵ < 5/10/15m)
Text2Loc modified 0.15 0.30 0.38 0.39 0.54 0.58
Text2Loc (Ours) 0.28 0.49 0.58 0.53 0.68 0.71
Ours
Table 6. Performance comparisons of changing one sentence in
the queries on the KITTI360Pose test set.

fine localization. More qualitative results are in the Supple-

mentary Materials.
Text descriptions Positive Submap Negative Submaps
6.5. Embedding space analysis
Figure 6. T-SNE visualization for the global place recognition.
We employ T-SNE [31] to visually represent the learned
embedding space, as illustrated in Figure 6. The baseline place recognition, we capture the contextual details within
method Text2Pos [12] yields a less discriminative space, and across text sentences with a novel attention-based
with positive submaps often distant from the query text de- method and introduce contrastive learning for the text-
scriptions and even scattered across the embedding space. submap retrieval task. In addition, we are the first to pro-
In contrast, our method brings positive submaps and query pose a matching-free fine localization network for this task,
text representations significantly closer together within the which is lighter, faster, and more accurate. Extensive exper-
embedding distance. It shows that the proposed network in- iments demonstrate that Text2Loc improves the localization
deed results in a more discriminative cross-model space for performance over the state-of-the-art by a large margin. Fu-
recognizing places. ture work will explore trajectory planning in real robots.

7. Conclusion Acknowledgements. This work was supported by the ERC

Advanced Grant SIMULACRON, by the Munich Center for
We proposed Text2Loc for 3D point cloud localization Machine Learning, and by the Royal Academy of Engineer-
based on a few natural language descriptions. In global ing (RF\201819\18\163).

14965
References Winter Conference on Applications of Computer Vision,
pages 1790–1799, 2021. 3
[1] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed [14] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners Bharath Hariharan, and Serge Belongie. Feature pyramid
for fine-grained 3d object identification in real-world scenes. networks for object detection. In Proceedings of the IEEE
In Computer Vision–ECCV 2020: 16th European Confer- Conference on Computer Vision and Pattern Recognition,
ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part pages 2117–2125, 2017. 3
I 16, pages 422–440. Springer, 2020. 3
[15] David G Lowe. Distinctive image features from scale-
[2] Mikaela Angelina Uy and Gim Hee Lee. Pointnetvlad: Deep invariant keypoints. International journal of computer vi-
point cloud based retrieval for large-scale place recognition. sion, 60:91–110, 2004. 2
In Proceedings of the IEEE Conference on Computer Vision [16] Junyi Ma, Jun Zhang, Jintao Xu, Rui Ai, Weihao Gu, and
and Pattern Recognition, pages 4470–4479, 2018. 2 Xieyuanli Chen. Overlaptransformer: An efficient and yaw-
[3] Tiago Barros, Luı́s Garrote, Ricardo Pereira, Cristiano Pre- angle-invariant transformer network for lidar-based place
mebida, and Urbano J Nunes. Attdlnet: Attention-based recognition. IEEE Robotics and Automation Letters, 7(3):
deep network for 3d lidar place recognition. In Iberian 6958–6965, 2022. 3
Robotics conference, pages 309–320. Springer, 2022. 3 [17] Junyi Ma, Guangming Xiong, Jingyi Xu, and Xieyuanli
[4] Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Chen. Cvtnet: A cross-view transformer network
Scanrefer: 3d object localization in rgb-d scans using natural for place recognition using lidar data. arXiv preprint
language. In European conference on computer vision, pages arXiv:2302.01665, 2023. 3
202–221. Springer, 2020. 3 [18] Zhixiang Min, Bingbing Zhuang, Samuel Schulter, Buyu
[5] Marco Cuturi. Sinkhorn distances: Lightspeed computation Liu, Enrique Dunn, and Manmohan Chandraker. Neurocs:
of optimal transport. In Advances in Neural Information Pro- Neural nocs supervision for monocular 3d object localiza-
cessing Systems. Curran Associates, Inc., 2013. 7 tion. In Proceedings of the IEEE/CVF Conference on Com-
[6] Haowen Deng, Tolga Birdal, and Slobodan Ilic. Ppfnet: puter Vision and Pattern Recognition (CVPR), pages 21404–
Global context aware local features for robust 3d point 21414, 2023. 1
matching. In Proceedings of the IEEE Conference on Com- [19] Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed,
puter Vision and Pattern Recognition, pages 195–205, 2018. Maximilian Sieb, Adam W Harley, and Katerina Fragki-
3 adaki. Embodied language grounding with implicit 3d visual
[7] Gil Elbaz, Tamar Avraham, and Anath Fischer. 3d point feature representations. 2019. 3
cloud registration for localization using a deep neural net- [20] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
work auto-encoder. In Proceedings of the IEEE conference Guibas. Pointnet++: Deep hierarchical feature learning on
on computer vision and pattern recognition, pages 4631– point sets in a metric space. Advances in neural information
4640, 2017. 3 processing systems, 30, 2017. 2, 4
[8] Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Zhiwu Lu, Jun [21] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-
He, and Xiaoyong Du. Svt-net: Super light-weight sparse tuning cnn image retrieval with no human annotation. IEEE
voxel transformer for large scale place recognition. AAAI, Transactions on Pattern Analysis and Machine Intelligence,
2022. 3 41(7):1655–1668, 2018. 3
[9] Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong [22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Aj- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
mal Mian. Free-form description guided 3d visual graph net- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
work for object grounding in point cloud. In Proceedings transferable visual models from natural language supervi-
of the IEEE/CVF International Conference on Computer Vi- sion. In International conference on machine learning, pages
sion, pages 3722–3731, 2021. 3 8748–8763. PMLR, 2021. 4, 5
[10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term [23] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
memory. Neural computation, 9(8):1735–1780, 1997. 2 Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
[11] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Peter J. Liu. Exploring the limits of transfer learning with a
Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai unified text-to-text transformer. Journal of Machine Learn-
Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu ing Research, 21(140):1–67, 2020. 2, 3, 4
Qiao, and Hongyang Li. Planning-oriented autonomous driv- [24] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
ing. In Proceedings of the IEEE/CVF Conference on Com- Bradski. Orb: An efficient alternative to sift or surf. In 2011
puter Vision and Pattern Recognition, 2023. 1 International conference on computer vision, pages 2564–
[12] Manuel Kolmet, Qunjie Zhou, Aljoša Ošep, and Laura Leal- 2571. Ieee, 2011. 2
Taixé. Text2pos: Text-to-point-cloud cross-modal localiza- [25] Hasim Sak, Andrew W. Senior, and Françoise Beaufays.
tion. In Proceedings of the IEEE/CVF Conference on Com- Long short-term memory based recurrent neural network ar-
puter Vision and Pattern Recognition, pages 6687–6696, chitectures for large vocabulary speech recognition. CoRR,
2022. 2, 3, 4, 5, 6, 7, 8, 1 abs/1402.1128, 2014. 2
[13] Jacek Komorowski. Minkloc3d: Point cloud based large- [26] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and
scale place recognition. In Proceedings of the IEEE/CVF Marcin Dymczyk. From coarse to fine: Robust hierarchical

14966
localization at large scale. In Proceedings of the IEEE/CVF [40] Wenxiao Zhang, Huajian Zhou, Zhen Dong, Qingan Yan,
Conference on Computer Vision and Pattern Recognition, and Chunxia Xiao. Rank-pointretrieval: Reranking point
pages 12716–12725, 2019. 2 cloud retrieval via a visually consistent registration evalu-
[27] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, ation. IEEE Transactions on Visualization and Computer
and Andrew Rabinovich. SuperGlue: Learning feature Graphics, 2022. 3
matching with graph neural networks. In CVPR, 2020. 7 [41] Zhicheng Zhou, Cheng Zhao, Daniel Adolfsson, Songzhi
[28] Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Su, Yang Gao, Tom Duckett, and Li Sun. Ndt-transformer:
Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Large-scale 3d point cloud localisation using the normal
Bulò, Richard Newcombe, Peter Kontschieder, and Vasileios distribution transform representation. In 2021 IEEE Inter-
Balntas. Orienternet: Visual localization in 2d public maps national Conference on Robotics and Automation (ICRA),
with neural matching. In Proceedings of the IEEE/CVF pages 5654–5660. IEEE, 2021. 3
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 21632–21642, 2023. 1
[29] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient
& effective prioritized matching for large-scale image-based
localization. IEEE transactions on pattern analysis and ma-
chine intelligence, 39(9):1744–1756, 2016. 2
[30] Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun.
Generalized-icp. In Robotics: science and systems, page
435. Seattle, WA, 2009. 3
[31] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. JMLR, 2008. 8
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 3
[33] Guangzhi Wang, Hehe Fan, and Mohan Kankanhalli. Text to
point cloud localization with relation-enhanced transformer.
arXiv preprint arXiv:2301.05372, 2023. 2, 3, 4, 5, 6, 7
[34] Yan Xia. Perception of vehicles and place recognition in
urban environment based on MLS point clouds. PhD thesis,
Technische Universität München, 2023. 1
[35] Yan Xia, Yusheng Xu, Shuang Li, Rui Wang, Juan Du,
Daniel Cremers, and Uwe Stilla. Soe-net: A self-attention
and orientation encoding network for point cloud based place
recognition. In Proceedings of the IEEE/CVF Conference
on computer vision and pattern recognition, pages 11348–
11357, 2021. 2, 4
[36] Yan Xia, Yusheng Xu, Cheng Wang, and Uwe Stilla. Vpc-
net: Completion of 3d vehicles from mls point clouds. ISPRS
Journal of Photogrammetry and Remote Sensing, 174:166–
181, 2021. 1
[37] Yan Xia, Mariia Gladkova, Rui Wang, Qianyun Li, Uwe
Stilla, João F Henriques, and Daniel Cremers. Casspr: Cross
attention single scan place recognition. In Proceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 8461–8472, 2023. 1, 3, 5
[38] Yan Xia, Qiangqiang Wu, Wei Li, Antoni B Chan, and
Uwe Stilla. A lightweight and detector-free 3d single object
tracker on point clouds. IEEE Transactions on Intelligent
Transportation Systems, 2023. 1
[39] Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang,
Sheng Wang, Zhen Li, and Shuguang Cui. Instancerefer:
Cooperative holistic understanding for visual grounding on
point clouds through instance multi-level contextual refer-
ring. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 1791–1800, 2021. 3