0% found this document useful (0 votes)
25 views12 pages

Multi-Modal, Multi-Granularity Path Representation Learning - Extended Version.18428v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

Multi-Modal, Multi-Granularity Path Representation Learning - Extended Version.18428v1

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MM-Path: Multi-modal, Multi-granularity Path Representation

Learning—Extended Version
Ronghui Xu1 , Hanyin Cheng1 , Chenjuan Guo1 , Hongfan Gao1 , Jilin Hu1 , Sean Bin Yang2 , Bin Yang1 *
1 East
China Normal University, 2 Chongqing University of Posts and Telecommunications
{rhxu,hycheng,hf.gao}@stu.ecnu.edu.cn,{cjguo,jlhu,byang}@dase.ecnu.edu.cn,[email protected]
Abstract Road path
(a) (b)
Developing effective path representations has become increasingly
𝑣4 𝑣5 𝑣7 𝑣8
essential across various fields within intelligent transportation. Al- 𝑣2 𝑣3
𝑣6
arXiv:2411.18428v1 [cs.LG] 27 Nov 2024

though pre-trained path representation learning models have shown Sub-road 𝑣1


Intersection
improved performance, they predominantly focus on the topological
(c)
structures from single modality data, i.e., road networks, overlooking Road segment End
the geometric and contextual features associated with path-related
images, e.g., remote sensing images. Similar to human understand- Start

ing, integrating information from multiple modalities can provide a


An example of a path Image path
more comprehensive view, enhancing both representation accuracy
and generalization. However, variations in information granularity
impede the semantic alignment of road network-based paths (road Figure 1: A path in different modalities
paths) and image-based paths (image paths), while the heterogeneity
of multi-modal data poses substantial challenges for effective fu- Keywords
sion and utilization. In this paper, we propose a novel Multi-modal,
Path representation learning, Multi-modal learning, Self-supervised
Multi-granularity Path Representation Learning Framework (MM-
learning
Path), which can learn a generic path representation by integrating
modalities from both road paths and image paths. To enhance the ACM Reference Format:
alignment of multi-modal data, we develop a multi-granularity align- Ronghui Xu, Hanyin Cheng, Chenjuan Guo, Hongfan Gao, Jilin Hu, Sean
Bin Yang, Bin Yang . 2018. MM-Path: Multi-modal, Multi-granularity Path
ment strategy that systematically associates nodes, road sub-paths,
Representation Learning—Extended Version. In Proceedings of Make sure
and road paths with their corresponding image patches, ensuring
to enter the correct conference title from your rights confirmation emai
the synchronization of both detailed local information and broader (Conference acronym ’XX). ACM, New York, NY, USA, 12 pages. https:
global contexts. To address the heterogeneity of multi-modal data //doi.org/XXXXXXX.XXXXXXX
effectively, we introduce a graph-based cross-modal residual fusion
component designed to comprehensively fuse information across 1 Introduction
different modalities and granularities. Finally, we conduct extensive
Understanding paths and developing effective path representations
experiments on two large-scale real-world datasets under two down-
are increasingly essential, offering invaluable insights for diverse
stream tasks, validating the effectiveness of the proposed MM-Path.
fields such as intelligent navigation [18, 19, 34, 35, 43], route rec-
This is an extended version of the paper accepted by KDD 2025. The
ommendation [7, 12, 49], urban planning [11, 17, 42], and urban
code is available at: https://github.com/decisionintelligence/MM-
emergency management [14]. Recent studies focus on developing
Path.
pre-trained path representation learning models, which have demon-
strated outstanding generalization capabilities [5, 21, 45]. These
CCS Concepts models efficiently produce generic path representations in an un-
• Computing methodologies → Machine learning. supervised manner. With simple fine-tuning and little labeled data,
they are adaptable to diverse downstream tasks such as travel time
estimation and path ranking score estimation. Consequently, they
* Corresponding authors significantly improve computational efficiency by reducing both
labeled data and runtime.
Paths have different modalities that provide richer, more diverse
information. For example, while paths derived from road networks
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed (road paths for short) elucidate topological relationships among road
for profit or commercial advantage and that copies bear this notice and the full citation segments in paths, remote sensing images of paths (image paths for
on the first page. Copyrights for components of this work owned by others than the short) provide insights into geometric features and broader environ-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission mental contexts (see Figure 1). Integrating these modalities enriches
and/or a fee. Request permissions from [email protected]. path representations with varied perspectives, thereby improving
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY accuracy and enhancing generalization capabilities. However, cur-
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06 rent path representation learning models primarily rely on single-
https://doi.org/XXXXXXX.XXXXXXX modality data from road networks, which fails to capture the deep,
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

comprehensive context essential for a complete understanding of intersections, road sub-paths, and entire road paths with their cor-
paths. This calls for developing a multi-modal pre-trained path rep- responding image information to capture details accurately at a
resentation learning model. Nonetheless, constructing such a model finer granularity as well as maintaining global correspondence at a
faces several challenges: coarser granularity. Specifically, we divide the image of the entire
Information granularity discrepancies between road paths interested region into small fixed-size images, collect the small fixed-
and image paths significantly hinder cross-modal semantic align- size images along each path, and arrange the collected images into
ment. Effective cross-modal alignment, which ensures semantic an image path (i.e., image sequence). We employ modal-specific
consistency and complementarity among various modalities, is cru- tokenizers to generate the initial embeddings for road paths and
cial for constructing multi-modal models [26]. However, the dis- image paths, respectively. Subsequently, these initial embeddings
crepancies in information granularity between road paths and image are fed into the powerful Transformer architecture to learn complex
paths are substantial. As depicted in Figure 1, road paths typically encoded embeddings for each modality at three granularities. Finally,
focus on detailed topological structures and delineate road connec- a multi-granularity alignment loss function is employed to ensure the
tivity, while the image paths capture global environmental contexts alignment of road and image encoded embeddings across different
on a large scale, reflecting the functional attributes of corresponding granularities.
regions. It is worth noting that images may include extensive regions To address the second challenge, we introduce a graph-based
that show low relevance to the road paths, such as the dark regions in cross-modal residual fusion component, which is designed to ef-
Figure 1 (c). Current image-text multi-modal models [2, 30, 37, 40] fectively fuse cross-modal features while incorporating spatial con-
typically align individual images with textual sequences. However, textual information. Specifically, we link the encoded embeddings
such single-granularity and coarse alignment methods introduce of each modality with the initial embeddings of the other modality
noise, which are not suitable for the precise alignment required for to create road and image residual embeddings, respectively, with
paths. Additionally, as shown in Figure 1 (a), roads have different the purpose of fusing cross-modal features from different stages.
granularities in nature, including intersections, road segments, and We then build a cross-modal adjacency matrix for each path based
sub-roads. Fully understanding paths at different granularities can on spatial correspondences and contextual information. This ma-
provide insights from micro to macro levels, mitigating the neg- trix guides the GCN to iteratively fuse the residual embeddings
ative effects caused by the differences in information granularity of each modality separately, thus obtaining road and image fused
across modalities. Although some studies [9, 32] have explored embeddings. Finally, we apply contrastive loss to ensure the con-
multi-granularity in single-modal data, they have not adequately sistency of the fused embeddings across the two modalities. As the
addressed the requirements for multi-granularity analysis in multi- final representation effectively integrates cross-stage features of the
modal contexts. Thus, it is crucial to refine multi-granularity data two modalities with spatial context information, this component
processing and develop multi-granularity methods for cross-modal not only achieves deep multi-modal fusion but also enhances the
alignment. comprehensive utilization of information.
The inherent heterogeneity of road paths and image paths The contributions of this work are delineated as follows:
poses a significant challenge during feature fusion. The differ- • We propose a Multi-modal, Multi-granularity Path Represen-
ences in data structure and information granularity between road tation Learning Framework that learns generic path represen-
paths and image paths extend to their learning methods. Road path tations applicable to various downstream tasks. To the best
representation learning typically focuses on connectivity and reacha- of our knowledge, MM-Path is the first model that leverages
bility between roads and intersections, as well as analyzing graph road network data and remote sensing images to learn generic
structures [3, 21, 44, 45]. Conversely, image learning methods that path representations.
are able to learn image paths prioritize object recognition and fea- • We model and align the multi-modal path information using
ture extraction, aiming for a broad understanding of image con- a fine-to-coarse multi-granularity alignment strategy. This
tent [1, 20]. These disparate learning methods lead to road paths strategy effectively captures both intricate local details and
and image paths mapped to different embedding spaces, resulting the broader global context of the path.
in feature dimensions with similar semantics containing entirely • We introduce a graph-based cross-modal residual fusion com-
different information. Simple fusion methods like early fusion (i.e., ponent. This component utilizes a cross-modal GCN to fully
integrating multiple modalities before or during the feature extrac- integrate information from different modalities while main-
tion stage) and late fusion (i.e., keeping each modality independently taining the consistency of dual modalities.
processed until the final fusion stage) may result in information loss • We conduct extensive experiments on diverse tasks using
and increased bias, and fail to capture subtle correlations between two real-world datasets to demonstrate the adaptability and
road paths and image paths [26, 47]. Therefore, a multi-modal fusion superiority of our model.
method that can capture the relationships among entities in different
modalities and ensuring effective data fusion, is critically needed. 2 Preliminaries
To address these challenges, we propose a Multi-modal, Multi-
granularity Path Representation Learning Framework, namely MM- 2.1 Basic Conception
Path, for learning generic path representations. Path. A path 𝑝 is a sequence of continuous junctions, which can be
To address the first challenge, we propose a multi-granularity observed from the road network view and the image view.
alignment component. This component systematically associates Road network. A road network is denoted as G = (V, E), where
V and E represent a set of nodes and edges, respectively. Node
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

𝑣 ∈ V is a road intersection or a road end. Edge 𝑒 ∈ E denotes a 3 Methodology


road segment connecting two nodes. Figure 3 illustrates the framework of MM-Path. This section intro-
Road paths. We define the sequence of nodes on a road network for duces the framework of MM-Path, describes its two main compo-
a path 𝑝 as a road path R (𝑝) = ⟨𝑣 1, 𝑣 2, ..., 𝑣 𝑁road ⟩, where each element nents, and details the final training objective.
represents a node, and 𝑁 road represents the length of the road path
R (𝑝). It is noted that there must be an edge 𝑒 ∈ E connecting any
3.1 Overall Framework
adjacent nodes in the road path.
Image paths. Given an interested region, we partition the region into Different from existing methods that are limited to data from a single
fixed-size segments to generate a set of images, M𝑟𝑠 , consisting of modality, MM-Path leverages data from both road networks and
𝑁 image disjoint, fixed-size remote sensing images. Each image within images for pre-training, providing a more comprehensive perspective.
this set is denoted as 𝑚 ∈ R𝑟 ×𝑟 ×𝑐 , where 𝑐 represents the number MM-Path comprises two main components: the multi-granularity
of channels, and (𝑟, 𝑟 ) denotes the resolution. Subsequently, given alignment component and the graph-based cross-modal residual
road path R (𝑝), the image path (i.e., image sequence of the path) fusion component.
M (𝑝) is formed by selecting a series of image 𝑚𝑖 that correspond The multi-granularity alignment component is designed to con-
to specific latitudes and longitudes along the nodes in the road path. centrate on path-related information while capturing fine-grained
For example, as shown in the upper part of Figure 2, consider the details and coarse-grained global context. Initially, we convert the
road path R (𝑝) = ⟨𝑣 1, ..., 𝑣 8 ⟩, where nodes 𝑣 1, 𝑣 2 , and 𝑣 3 are located image of interested region into fixed-sized image sequences to obtain
in image 𝑚 1 , nodes 𝑣 4, 𝑣 5 and 𝑣 6 in image 𝑚 2 , and nodes 𝑣 7 and 𝑣 8 image paths. Subsequently, We establish road path and image path
in image 𝑚 3 . This results in the image path M (𝑝) = ⟨𝑚 1, 𝑚 2, 𝑚 3 ⟩. encoding branches to process the two modalities. In each branch,
a modal-specific tokenizer generates initial embeddings for each
modality at three granularities: node/patch, road sub-path/image, and
𝑚1 𝑚2 𝑚3
road path/image path. These initial embeddings are then processed
by road and image transformers to produce road and image encoded
Image 𝑣5 𝑣7 𝑣8 embeddings, respectively, which are also generated at the same three
𝑣2 𝑣3 𝑣4 𝑣6
𝑚𝑖 ∈ ℝ𝑟×𝑟×𝑐 granularities. A multi-granularity loss function is utilized to syn-
𝑣1
chronize the semantic information of road encoded embeddings and
Patching Patching image encoded embeddings, and to capture their interrelations at
different granularities, from fine to coarse.
The graph-based cross-modal residual fusion component is de-
Patch
𝑚𝑖
(𝑗)
∈ ℝ𝑜×𝑜×𝑐
signed to effectively fuse cross-modal heterogeneous data. A cross-
(13)
𝑚1 modal residual connection merges the encoded embedding from each
Flattening Flattening branch with the initial embedding from the other branch, generat-
Tokens: [cls] … … [sep] … … [sep] … … [sep]
ing road and image residual embeddings. This connection considers
cross-modal features at different stages, promoting deep cross-modal
feature fusion. Subsequently, we construct a cross-modal adjacency
Figure 2: An example of image path processing
matrix for each path based on spatial correspondences and contex-
tual information. This matrix, embedded within a GCN, guides the
fusion of the two modalities for each branch. Consequently, a fused
Road Sub-paths. Given a road path R (𝑝) and an image path M (𝑝), embedding is obtained for each branch. We introduce a contrastive
the nodes of R (𝑝) located in the same image belong to a road sub- loss to ensure the consistency between the road fused embedding
path. and image fused embedding. Finally, we concatenate these two fused
Taken Figure 2 as an example, the road path R (𝑝) = ⟨𝑣 1, . . . , 𝑣 8 ⟩ embeddings to obtain a generic path representation.
has three road sub-paths: 𝑠 1 = ⟨𝑣 1, 𝑣 2, 𝑣 3 ⟩, 𝑠 2 = ⟨𝑣 4, 𝑣 5, 𝑣 6 ⟩ and 𝑠 3 =
⟨𝑣 7, 𝑣 8 ⟩.
3.2 Multi-granularity Alignment
We model the road paths and image paths using Transformer ar-
2.2 Problem Statement
chitecture, respectively. We then construct a multi-granularity loss
Given the road path R (𝑝) and image path M (𝑝) of a path 𝑝, the function to ensure alignment between these two modalities.
goal is to learn an embedding function 𝑓 that returns a generic
representation of path 𝑝. This function can be formalized as follows: 3.2.1 Input Representations. Due to information granularity
discrepancies between road paths and image paths, direct alignment
x = 𝑓 (R (𝑝), M (𝑝)), (1) is often disturbed by irrelevant information. To solve this problem,
we use a sequence of fixed-size images, instead of a single image
where x ∈ R𝑑 represents the generic embedding of path 𝑝, and 𝑑 commonly used in traditional image-text multi-modal methods [2,
denotes the dimension of the embedding x. 30, 37, 40], to model image paths. This procedure preserves the
These learned path embeddings are supposed to be generic, which scale and shape features of the images by avoiding distortions caused
should support a variety of downstream tasks, e.g., path travel time by inconsistencies in image sizes. Furthermore, as these fixed-size
estimation and path ranking score estimation. images can be utilized across different paths, the storage of images is
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

Multi-granularity Alignment Graph-based Cross-Modal


Residual Fusion
Image Path Encoding Branch

Image encoded embeddings

Image residual embeddings


Image initial embeddings
cls cls
Image fused

Image-Transformer
Multi-granularity embedding
𝑚1 𝑚2 𝑚3

AVGPooling
Tokenizer
sep sep Loss

Patch
Image


Patch Image
sep sep path
Image path Multi-layer
Node-unrelated token Image token GCN


sep sep
Node-related token Image path token
sep cls Contrastive
loss
Road Path Encoding Branch

Road encoded embeddings

Road residual embeddings


cls cls
Road initial embeddings

Road-Transformer
Random Masking
Node Tokenizer

𝑣3


AVGPooling
𝑣7 𝑣8 sep sep sep cls

𝑣2 𝑣3 𝑣4 𝑣5 𝑣6
𝑣𝑖


𝑣1 sep sep

𝑣7 Node Road Road


Road Path sub-path path
Multi-layer
Road fused


GCN

Node token
sep sep embedding
Road sub-path token Road path token

Figure 3: Overall framework of MM-Path

reduced. Then, we utilize specialized tokenizers to separately encode the road path R (𝑝) to a node embedding v𝑖 ∈ R𝑑 , initialized using
the data of each modality into a unified format. Node2vec [16]. We also integrate standard learnable road position
The patch tokenizer segments each image within an image path embeddings Troad ∈ R𝑛2 ×𝑑 with the node embeddings to generate
into a series of patches to extract fine-grained semantic informa- the road initial embeddings P (0) = [vcls, v1, ..., v | R (𝑝 ) | , vsep ] +Troad ,
tion. Specifically, as shown in Figure 2, an image 𝑚𝑖 ∈ R𝑟 ×𝑟 ×𝑐 where vcls and vsep represent the road initial embeddings of the [cls]
is reshaped into a sequence of 𝑟 2 /𝑜 2 (e.g., 16) patches, where 𝑐 and [sep] tokens, respectively. 𝑛 2 denotes the length of the road path
represents the number of channels, (𝑟, 𝑟 ) denotes the resolution token sequence.
of fixed-size image, and (𝑜, 𝑜) defines the resolution per patch.
3.2.2 Image Path Encoding. To model the image path, the image
After patching, we concatenate the patch sequences from all im-
ages within a image path to form a unified patch sequence. Then, initial embeddings H (0) are fed into the 𝑙 layers Image-Transformer,
we place a special [cls] token at the beginning of the patch se- which can be formulated as,
quence. As the [cls] token captures the global information of the H ( 𝑗 ) = Image-Transformer(H ( 𝑗 −1) ), (2)
entire sequence [1], it can be regarded as a representation of the
where 𝑗 = 1, ..., 𝑙, and H (𝑙 ) ∈ R𝑛1 ×𝑑 is the final output of the Image-
entire image path. Special [sep] tokens are placed at the end of
Transformer. For a simplify, we denote the image encoded embed-
each image to delineate local information of each image. For ex- (1) (𝑟 2 /𝑜 2 ) (𝑗)
ample, the token sequence of image path M (𝑝) = ⟨𝑚 1, 𝑚 2, 𝑚 3 ⟩ is dings H (𝑙 ) as H = [hcls, h1 , ..., h | M (𝑝 ) | , hsep|M (𝑝 ) | ]. h𝑖 ∈ R𝑑
(1) (16) (1) (16) (𝑗) denotes the encoded embedding of the 𝑗-th patch of the 𝑖-th image.
[cls, 𝑚 1 , . . . , 𝑚 1 , sep, . . . , sep, m3 , . . . , 𝑚 3 , sep], where 𝑚𝑖 ∈
R𝑜 ×𝑜 ×𝑐 denotes the 𝑗-th patch of the 𝑖-th image. hsep𝑖 , hcls ∈ R𝑑 represent the encoded embeddings of the 𝑖-th image
(𝑗) (𝑗) and the whole image path M (𝑝), respectively.
Each patch 𝑚𝑖 is then projected into a patch embedding m𝑖 ∈
𝑑
R , which can be initialized using pre-trained ResNet50 [20]. The 3.2.3 Road Path Encoding. To facilitate alignment between road
image initial embeddings are computed by summing the patch em- paths and image paths, we utilize a similar Transformer architecture
beddings with the image position embeddings Timage ∈ R𝑛1 ×𝑑 , for road path modeling. The encoder is comprised of 𝑙 layers of
(1) (𝑟 2 /𝑜 2 ) Transformer blocks, defined as:
resulting in H (0) = [mcls, m1 , ..., m | M (𝑝 ) | , msep ] + Timage . Here,
mcls and msep are the image initial embeddings of the [cls] and P ( 𝑗 ) = Road-Transformer(P ( 𝑗 −1) ), (3)
[sep] tokens, respectively. 𝑛 1 denotes the length of the patch token where 𝑗 = 1, ..., 𝑙, P( 𝑗 ) R𝑛2 ×𝑑
∈ is the output of the 𝑗-th layer. In
sequence, and 𝑑 represents the dimension of the embeddings. a brief, the road encoded embeddings P (𝑙 ) are represented as P =
The modeling for a road path is similar, starting with a [cls] to- [pcls, p1, . . . , p | R (𝑝 ) | , psep|M (𝑝 ) | ]. Here, p𝑖 , psep𝑖 , pcls ∈ R𝑑 denote
ken at the beginning and placing [sep] tokens at the end of each the encoded embeddings of the 𝑖-th node, the sub-path 𝑠𝑖 , and the
road sub-path. For instance, a road path R (𝑝) comprising three entire road path R (𝑝), respectively. |R (𝑝)| represents the number
road sub-paths—𝑠 1 = ⟨𝑣 1, 𝑣 2, 𝑣 3 ⟩, 𝑠 2 = ⟨𝑣 4, 𝑣 5, 𝑣 6 ⟩, and 𝑠 3 = ⟨𝑣 7, 𝑣 8 ⟩— of nodes in road path R (𝑝), and |M (𝑝)| indicates the number of
generates the node token sequence [cls, 𝑣 1, 𝑣 2, 𝑣 3, sep, 𝑣 4, 𝑣 5, 𝑣 6, sep, 𝑣 7, images in image path M (𝑝), which also corresponds to the number
𝑣 8, sep]. Then, the node tokenizer linearly projects each node 𝑣𝑖 from of road sub-paths.
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

To better capture the complex dependencies in a path, similar for coarse-grained data. Considering a batch of road path-image
to masked language modeling task (MLM) [22], we use a masked path pairs B, the objective of this contrastive learning loss is to
node modeling task as a self-supervised task. The intuition behind accurately identify the matched pairs among the |B| × |B| possible
this is that the information density of individual pixels or patches combinations. Within a training batch, there are |B| 2 − |B| negative
in an image is relatively low compared to the topological structure pairs. The contrastive learning loss function can be formulated as:
information relevant to the path. Therefore, image masking tasks ∑︁ exp(sim(pcls, hcls )/𝜎)
are not deemed essential. To this end, we propose to employ node Lcoarse = − (log( Í Neg
masking tasks. In particular, we randomly mask the nodes in road 𝑝∈P 𝑚 Neg ∈ B exp(sim(pcls , hcls )/𝜎)
(7)
paths (Ref. as to the gray triangle in Figure 3) , and then use a exp(sim(pcls, hcls )/𝜎)
+Í Neg
),
softmax classifier to predict the node tokens corresponding to the
𝑝 Neg ∈ B exp(sim(pcls , hcls )/𝜎)
masked nodes. The loss function for training is defined as follows:
∑︁ ∑︁ where pcls and hcls (Ref. as to the dark yellow rectangle and the
Lmask = − log P(𝑣𝑖 |𝑣𝑖mask ), dark green triangle with dark blue borders in Figure 3) correspond
(4)
𝑝∈P 𝑖∈D to the encoded embeddings of the entire road path and image path in
where P represents the training sets of all paths, D is randomly a road path-image path pairs. 𝑝 Neg and 𝑚 Neg are the negative road
masked positions of road path, and 𝑣𝑖mask is the node that is masked path and image path in the batch set B, respectively. 𝜎 is a learned
according to D. temperature parameter. sim(pcls, hcls ) returns the Euclidean distance
between pcls and hcls .
3.2.4 Modalities Aligning. The encoded embeddings from each
Finally, the multi-granularity loss can be formulated as Lmulti =
branch capture the hidden semantic information within their re-
Lfine + Lmedium + Lcoarse .
spective modality, including fine-grained node/patch embeddings,
medium-grained road sub-path/image embeddings, and coarse-grained 3.3 Graph-based Cross-modal Residual Fusion
entire road path/image path embeddings. We aim for embeddings
with similar semantics across modalities to be proximate within the In this framework, each path is represented through two distinct
embedding space. Additionally, we seek detailed alignment between modalities, providing complementary perspectives. To effectively
the two modalities while maintaining global correspondence. Ac- leverage these modalities, we propose a graph-based cross-modal
cordingly, we design a loss function that operates at three distinct residual fusion component.
levels of granularity—fine, medium and coarse—corresponding to 3.3.1 Cross-modal Residual Connection. To facilitate com-
node/patch, road sub-path/image, and entire road path/image path, prehensive information exchange between modalities, we introduce
respectively. cross-modal residual connections that effectively concatenate em-
Fine granularity. Since each patch may contain more than one beddings across different stages and modalities. These connections
node, the encoded embeddings of a node and the corresponding patch enable direct propagation of gradients to earlier layers, thereby en-
(Ref. as to the dark yellow triangle the dark green rectangle with hancing stability and improving training efficiency. Specifically, we
yellow borders in Figure 3) should maintain directional consistency. concatenate the road initial embeddings P (0) with the image en-
To precisely capture the semantic information of fine-grained paths, coded embeddings H, and the image initial embeddings H (0) with
we minimize the cosine distance between the encoded embeddings of the road encoded embeddings P. The resulting image residual embed-
nodes and their corresponding patches. Consequently, we construct dings and road residual embeddings are defined as U = P (0) ∥H and
the fine-grained loss function as follows: Q = P∥H (0) , respectively. Here, U, Q ∈ R (𝑛1 +𝑛2 ) ×𝑑 , and ∥ denotes
(𝑘 ) the concatenation operation.
∑︁ ∑︁ p𝑖 · h 𝑗
Lfine = (1 − ), (5)
(𝑘 )
∥p𝑖 ∥∥h 𝑗 ∥ 3.3.2 Graph-based Fusion. Although traditional attention mech-
𝑝 ∈ P 𝑣 ∈ R (𝑝 ),L(𝑣 )=𝑚 (𝑘 )
𝑖 𝑖 𝑗 anisms proficiently identify correlations among entities, they often
where P represents the training set of paths, L(𝑣𝑖 ) is a function that fail to incorporate contextual information concurrently. To address
(𝑘 )
returns the patch corresponding to the node 𝑣𝑖 , p𝑖 and h 𝑗 are the this limitation, we utilize graph neural networks [4], which incorpo-
(𝑘 )
rate contextual information into the learning process by representing
encoded embeddings of 𝑣𝑖 and 𝑚 𝑗 , respectively. it as graph structures. Leveraging this capability, we introduce a
Medium granularity. Similarly, to align road sub-paths with graph-based fusion method to enhance the accuracy of information
images, we construct the following medium-grained loss function: understanding across different modalities.
∑︁ ∑︁ psep𝑖 · hsep𝑖 Initially, we construct a specialized cross-modal directed graph
Lmedium = (1 − ), (6) for each path. This graph treats all tokens, including [cls] and [sep]
∥psep𝑖 ∥∥hsep𝑖 ∥
𝑝 ∈ P 𝑠𝑖 ∈ R (𝑝 ) tokens from both modalities, as entities. These entities are connected
where psep𝑖 and hsep𝑖 (Ref. as to the dark yellow rectangle and the via three types of relationships: intra-modal context, cross-modal
dark green triangle with blue borders in Figure 3) are the encoded correspondence, and cross-modal context. The intra-modal context
embeddings of road sub-path 𝑠𝑖 and the corresponding image 𝑚𝑖 , focuses on interactions within a single modality, facilitating a deep
respectively. understanding of its specific information. Cross-modal correspon-
Coarse granularity. Due to the unique correspondence between dence aids in comprehending and learning the spatial correspondence
road path and the corresponding image path, a clearer distinction between different modalities. Cross-modal context addresses indi-
is necessary. Therefore, We construct a contrastive loss function rect relationships between different modal entities, which enhances
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

L(𝑣3 ) L(𝑣4 ) L(𝑣5 ) L(𝑣3 ) L(𝑣4 ) L(𝑣5 )


𝑟
(𝑗− )
𝑜
3.3.3 Cross-modal Constrastive Loss. The image fused embed-
𝑚𝑖
𝑚𝑖
(𝑗+1) ding y and the road fused embedding z encapsulate features across
L(𝑣4 )
multiple modalities of the same path, reflecting an inherent similar-
(𝑗−1)
𝑚𝑖
ity. Therefore, we implement a quadruplet loss function to ensure
𝑟
𝑣3 𝑣4 𝑣5 𝑣3 𝑣4 𝑣5 (𝑗+𝑜)
𝑚𝑖
that the difference between y and z is smaller than the differences
with the fused embeddings of other paths. For negative samples, we
𝑟 𝑟
Entities … 𝑣3 𝑣4 𝑣5 … L(𝑣3 ) … 𝑚(𝑗−𝑜) … 𝑚𝑖(𝑗−1) L(𝑣4 ) 𝑚𝑖(𝑗+1) … 𝑚(𝑗+𝑜) … L(𝑣5 ) …
𝑖 𝑖
randomly sample the image fused embedding yN and the road fused
𝑣4 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 embedding zN from the batch. The loss function is defined as:
L(𝑣4 ) 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ∑︁
Lfuse = − ([||y − z|| 22 − ||y − zN || 22 + 𝛽] +
𝑝∈P (12)
Figure 4: An example of multi-modal graph construction
+ [||y − z|| 22 − ||z − yN || 22 + 𝛽] + ),

the model’s ability to interpret complex scenes. Collectively, lever- where 𝛽 is a hyperparameter that controls the margin of the dis-
aging these relationships significantly boosts the model’s capacity tance between pairs of positive and negative samples, and [·] + is a
to handle multi-modal data effectively. shorthand for max(0, ·).
Figure 4 demonstrates the construction of the graph. Taking node
𝑣 4 as an example, node 𝑣 4 is connected by directed edges from five 3.4 Training objective
entities: adjacent context nodes 𝑣 3 and 𝑣 5 , its corresponding patch The final training objective of our model integrates all previously
L(𝑣 4 ), and their respective patches L(𝑣 3 ) and L(𝑣 5 ). The patch L(𝑣 4 ) proposed loss functions, formulated as follows:
(𝑗)
(i.e., 𝑚𝑖 ) is connected by directed edges from nine entities, includ- L = 𝜆mask Lmask + 𝜆multi Lmulti + 𝜆fuse Lfuse, (13)
ing 𝑣 3 , 𝑣 4 , 𝑣 5 , L(𝑣 3 ), L(𝑣 5 ), and four geographically adjacent image
( 𝑗 −1) ( 𝑗+1) (𝑗− 𝑟 ) ( 𝑗+ 𝑟 ) where 𝜆mask , 𝜆multi and 𝜆fuse are the weights assigned to Lmask ,
patches 𝑚𝑖 , 𝑚𝑖 , 𝑚𝑖 𝑜 and 𝑚𝑖 𝑜 , where 𝑜𝑟 denotes the
Lmulti , and Lfuse .
number of patches per row of an image. Additionally, the [sep] to-
After pre-training, we combine the image fused embedding y with
kens are connected by their context [sep] tokens, corresponding [sep]
the road fused embedding z into a generic path embedding x = y||z,
tokens from another modality, and cross-modal context tokens. The
achieving a more robust and generalized representation. The generic
[cls] token, encapsulating more global information, is connected by
path embedding is then fine-tuned using interchangeable linear layer
all [sep] tokens and corresponding [cls] token from another modality.
task heads, enabling the model to adapt to a variety of downstream
Then, we construct an adjacency matrix A ∈ R (𝑛1 +𝑛2 ) × (𝑛1 +𝑛2 )
tasks effectively.
for each path 𝑝 to capture the comprehensive relation within the
multi-modal data. Given the effectiveness of GCNs in transferring
4 Experiments
and fusing information across entities within a graph structure, we
employ a GCN to derive the updated embeddings for both branches. 4.1 Experimental Setups
The updated embeddings are computed as follows: 4.1.1 Datasets. We utilize the road networks, GPS datasets, and
remote sensing image datasets of two cities: Aalborg, Denmark, and
 1  1  
− −1 − −1
Û = Relu D̃ 2 ÃD̃ 2 Relu D̃ 2 ÃD̃ 2 UW1 W2 , (8) Xi’an, China. The road networks are sourced from OpenStreetMap1 ,
 1  1   while the remote sensing image datasets are acquired from Google
− −1 − −1
Q̂ = Relu D̃ 2 ÃD̃ 2 Relu D̃ 2 ÃD̃ 2 QW3 W4 , (9) Earth Engine[15]. Employing an existing tool [31], we map-match
all GPS records to road networks to generate the path datasets and
where W1, W2, W3, W4 ∈ R𝑑 ×𝑑 are weight matrices, and 𝐷˜ is the historical trajectory datasets. The details of the datasets are shown
degree matrix of Ã. The augmented adjacency matrix à = A + I′ , in Table 1.
where I′ is a modified identity matrix with all diagonal elements set
4.1.2 Implementation Details. All experiments are conducted
to 1, except for those corresponding to patches without relationship
using PyTorch [33] on Python 3.8 and executed on an NVIDIA
to any nodes (Ref. as to the dark green rectangle with a white border
Tesla-A800 GPU. Each fixed-size image is 500 × 500 pixels, with
in Figure 3). This modification aims to exclude patches that are
each pixel corresponding to 2 meters on the earth. In other words, an
relatively unrelated to the path, thereby preventing the introduction
image covers a 1km × 1km region. We segment each image into 16
of noise into the model.
patches and set the embedding dimension 𝑑 to 64. Both the Image-
After iterative graph convolution operations, the embeddings of
Transformer and Road-Transformer comprise five layers. To enhance
each entity within the graph are updated. We perform average pool-
the pre-training efficiency, we initialize our Road-Transformer with
ing on Û and Q̂ to aggregate the updated embeddings, respectively.
the pre-trained LightPath [45]. The mask ratio is set at 15%. The
The fused embedding for each branch is then obtained by:
weights 𝜆mask , 𝜆fuse , and 𝜆multi are uniformly set to 1. Training
y = AvgPooling( Û), (10) proceeds for up to 60 epochs with a learning rate of 0.02. The linear
layer task head includes two fully connected layers, with dimensions
z = AvgPooling( Q̂), (11)
of 32 and 1, respectively. Training MM-Path on the Aalborg and
where y, z ∈ R𝑑
denote the image fused embedding and the road
fused embedding, respectively. 1 https://www.openstreetmap.org
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Table 1: Data statistics • USPM [8] utilizes both images and road network to profile
individual streets, but not paths (i.e., sequences of streets).
Aalborg Xi’an We adapt USPM to support path representation learning.
Number of nodes 7,561 7,051 • JGRM [29]: This is a representation learning framework
Number of edges 9,605 9,642 that combines GPS data and road network-based data. In this
AVG edge length (m) 124.78 86.49 study, we replace the GPS data with the image path.
• LightPath+image: This is a multi-modal variant of Light-
Number of path 47,865 200,000
Path. We concatenate the patch embedding with the node
AVG node number per road path 25.77 55.39
embedding to replace the original node embedding for train-
AVG path length (m) 3,252.75 4,743.70 ing.
Number of traj. 149,246 797,882 • START+image: This is a multi-modal variant of START,
Max travel time of traj. (s) 3,549 8,638 processed similarly to LightPath+image.
Avg travel time of traj. (s) 199 662
For all methods, we standardize the embedding dimensionality
Number of images 950 133 (𝑑) to 50. All parameters are set according to the specifications in the
AVG number of nodes per image 7.96 53.01 original papers. All baselines are fine-tuned using a linear layer task
AVG image number per image path 6.28 6.87 head. The output of this task head serves as the prediction result.
For all methods, we initially pre-train using unlabeled training
data (e.g., 30K unlabeled Aalborg dataset and 160K unlabeled Xi’an
dataset). Subsequently, we use a smaller volume of labeled data
Xi’an datasets takes 78 and 161 minutes, respectively. Since training (e.g., 10K labeled Aalborg dataset and 40K labeled Xi’an dataset)
is conducted offline, the runtime is acceptable. for task-specific fine-tuning. Validation and evaluation processes are
4.1.3 Downstream Tasks and Metrics. Path Travel Time Esti- conducted on separate validation dataset (e.g., 5K Aalborg dataset
mation: We calculate the average travel time (in seconds) for each and 20K Xi’an dataset) and test dataset (e.g., 10K Aalborg dataset
path based on historical trajectories. The accuracy of travel time and 40K Xi’an dataset), respectively.
estimations is evaluated using three metrics: Mean Absolute Error
(MAE), Mean Absolute Relative Error (MARE), and Mean Absolute 4.2 Experimental Results
Percentage Error (MAPE). Path Ranking Score Estimation (Path 4.2.1 Overall Performance. Table 2 presents the overall perfor-
Ranking): Each path is assigned a ranking score ranging from 0 mance on both tasks. We use ‘↑’ (and ‘↓’) to indicate that larger (and
to 1, derived from historical trajectories by following existing stud- smaller) values are better. For each task, we highlight the best and
ies [44–46]. We evaluate the effectiveness of path ranking using second-best performance in bold and underline. “Improvement” and
MAE, the Kendall rank correlation coefficient (𝜏), and Spearman’s “Improvement*” quantify the enhancements achieved by MM-Path
rank correlation coefficient (𝜌). over the best single-modal and multi-modal baselines, respectively.
Overall, MM-Path outperforms all baselines on these tasks across
4.1.4 Baselines. We compare the proposed model with 5 unsu- both datasets, demonstrating its superiority. Specifically, we can
pervised single-modal path pre-trained methods and 5 unsupervised make the following observations: The graph representation learning
multi-modal methods. The single-modal path pre-trained methods method Node2vec significantly underperforms compared to MM-
are: Path, primarily due to its focus solely on the topological information
• Node2vec [16]: This is an unsupervised model that learn node of nodes while overlooking the sequential information of paths.
representation based on a graph. Single-modal models like PIM, LightPath, and TracjCL show im-
• PIM [3]: This is an unsupervised path representation learning proved performance over Node2vec, indicating the importance of
approach based on mutual information maximization. capturing sequential correlations within paths. Among the single-
• LightPath [45]: This is a lightweight and scalable path repre- modal models, START achieves the best performance. It adeptly
sentation learning method. integrate sequential path information with spatio-temporal transition
• TrajCL [5]: This is a contrastive learning-based trajectory relationships derived from historical trajectory data. However, as
modeling method. a single-modal model, its capabilities are inherently constrained.
• START [21]: This is a self-supervised trajectory representa- As a multi-modal model, CLIP exhibits the weakest performance.
tion learning framework with temporal regularities and travel Designed primarily for general corpora, it focuses on single, coarse-
semantics. grained image representations, which often introduce noise into path
modeling. Consequently, CLIP struggles to effectively capture com-
The multi-modal methods are:
plex spatial information and correspondences, making it unsuitable
• CLIP [37]: This is a classic pre-trained multi-modal model. for modeling paths. USPM performs poorly because it analyzes in-
For each path, we use a single rectangular image for the im- dividual streets using images and road networks, rather than paths
age modality and replace the original text sequence with a (i.e., street sequences). As a result, it fails to effectively mine the
node sequence. After pre-training, we concatenate the repre- sequential relationships present in the two modalities. The variants
sentations of the two modalities and use them as input to the LightPath+image and START+image perform comparably to their
linear layer task head. single-modal models (i.e., LightPath and START), suggesting that
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

Table 2: Overall accuracy on travel time estimation and path ranking

Aalborg Xi’an
Methods Travel Time Estimation Path Ranking Travel Time Estimation Path Ranking
MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑ MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑
Node2vec [16] 76.228 0.281 54.182 0.203 0.119 0.140 227.129 0.269 30.919 0.218 0.079 0.098
PIM [3] 63.812 0.237 47.054 0.144 0.284 0.343 207.266 0.246 27.716 0.207 0.091 0.102
Lightpath [45] 58.818 0.221 40.219 0.124 0.413 0.483 201.400 0.229 26.429 0.178 0.209 0.252
TrajCL [5] 53.822 0.208 34.239 0.113 0.499 0.577 202.757 0.238 26.506 0.181 0.211 0.256
START [21] 51.176 0.191 34.315 0.117 0.475 0.556 199.843 0.215 25.022 0.179 0.229 0.279
CLIP [37] 72.155 0.261 50.284 0.162 0.179 0.185 219.048 0.256 30.962 0.213 0.087 0.099
USPM [8] 66.714 0.249 51.916 0.148 0.308 0.383 205.594 0.244 26.039 0.209 0.105 0.110
JGRM [29] 51.251 0.193 32.380 0.115 0.512 0.592 201.010 0.228 26.400 0.177 0.228 0.262
Lightpath+image 59.698 0.224 40.920 0.131 0.383 0.405 205.556 0.242 27.058 0.182 0.188 0.231
START+image 51.859 0.188 33.401 0.122 0.437 0.521 200.059 0.211 26.046 0.184 0.183 0.226
MM-Path 47.756 0.172 29.808 0.106 0.558 0.643 187.452 0.193 23.644 0.165 0.257 0.294
Improvement 6.682% 9.947% 12.941% 6.194% 11.823% 11.443% 6.201% 10.236% 5.507% 7.303% 12.227% 5.376%
Improvement* 6.819% 8.511% 7.943% 7.826% 8.984% 8.614% 6.312% 8.531% 9.222% 6.780% 12.719% 12.213%

merely concatenating two modalities does not effectively enhance significantly enhances the model’s effectiveness. This conclusively
multi-modal fusion. Having adapted JGRM to integrate image paths validates that MM-Path optimally utilizes all designed components.
and road paths, JGRM outperforms other multi-modal baselines. It is
specifically designed for multi-modal integration and excels at merg- Table 3: Effect of variants of MM-Path in Aalborg
ing information from various sources. However, JGRM’s limitations Aalborg
in handling multi-modal information of varying granularities and its Methods Travel Time Estimation Path Ranking
lack of use of cross-modal context information to guide the fusion MAE MARE MAPE MAE 𝜏 𝜌
process make its performance less optimal compared to MM-Path. MM-Path-z 49.649 0.185 30.193 0.114 0.528 0.622
MM-Path-y 48.529 0.181 32.722 0.118 0.511 0.603
w/o alignment 52.832 0.201 36.251 0.131 0.300 0.379
4.2.2 Ablation Study. We design eight variants of MM-Path to
w/o fusion 51.237 0.192 30.529 0.115 0.476 0.560
verify the necessity of the components of our model: (1) MM-Path-
w/o GCN 48.651 0.183 33.371 0.111 0.532 0.619
z: This variant leverages the road fused embedding z as a generic
w/o fine 51.641 0.192 33.277 0.129 0.441 0.523
representation of the path. (2) MM-Path-y: This model utilizes the w/o medium 50.932 0.187 34.250 0.114 0.494 0.583
image fused embedding y as a generic representation of the path. w/o coarse 50.688 0.189 35.341 0.117 0.505 0.596
(3) w/o alignment: This version excludes the multi-granularity loss. MM-Path 47.756 0.172 29.808 0.106 0.558 0.643
(4) w/o fusion: This variant substitutes the graph-based residual
fusion component with average pooling of the encoded embeddings
from both modalities. (5) w/o GCN: This model replaces the GCN Table 4: Effect of variants of MM-Path in Xi’an
in the graph-based cross-modal residual fusion component with a
cross-attention mechanism. (6) w/o fine, (7) w/o medium, and (8) Xi’an
w/o coarse: These variants omit the fine-grained, medium-grained, Methods Travel Time Estimation Path Ranking
MAE MARE MAPE MAE 𝜏 𝜌
and coarse-grained loss, respectively.
MM-Path-z 194.301 0.231 24.455 0.183 0.199 0.241
The results are summarized in Tables 3 and 4. We can observe
MM-Path-y 196.331 0.233 24.747 0.196 0.178 0.214
that MM-Path w/o alignment shows poor performance, which is
w/o alignment 200.335 0.239 26.459 0.195 0.131 0.167
attributed to its reliance solely on multi-modal data fusion without
w/o fusion 200.652 0.239 25.433 0.208 0.113 0.130
considering multi-granularity alignment. The variants, w/o fine, w/o
w/o GCN 189.659 0.221 24.496 0.173 0.234 0.286
medium, and w/o coarse, outperform w/o alignment but still worse w/o fine 199.214 0.235 26.913 0.177 0.226 0.275
than the full MM-Path, demonstrating the importance of multiple w/o medium 192.514 0.229 24.826 0.175 0.227 0.278
granularity alignments. MM-Path w/o fusion also exhibits poor per- w/o coarse 194.256 0.230 25.757 0.176 0.231 0.278
formance, while MM-Path w/o GCN performs slightly worse than MM-Path 187.452 0.193 23.644 0.165 0.257 0.294
MM-Path. These results indicate that complex fusion methods with
cross-modal context information enhance path understanding. Both
MM-Path-y and MM-Path-z demonstrate comparable performance 4.2.3 Effect of Pre-training. In this section, we evaluate the ef-
in travel time estimation and path ranking. This indicates that dif- fect of pre-training. We vary the size of labeled data used for fine-
ferent modal perspectives contribute valuable insights for various tuning, and compare the performance of the proposed MM-Path
downstream tasks. The overall performance of MM-Path surpasses (Pre-trained) with its variant that lacks pre-training (No Pre-trained).
all variants. This result implies that each of the proposed components Figure 5 shows the performance of travel time estimation and path
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

 
 3UHWUDLQHG 3UHWUDLQHG 3UHWUDLQHG

1R3UHWUDLQHG 1R3UHWUDLQHG 1R3UHWUDLQHG
 

0$(



0$(
  
3UHWUDLQHG 
1R3UHWUDLQHG
   
                   
7UDLQLQJ'DWD6L]H . 7UDLQLQJ'DWD6L]H . 7UDLQLQJ'DWD6L]H . 7UDLQLQJ'DWD6L]H .
(a) Travel Time Estimation, Aalborg (b) Path Ranking, Aalborg (c) Travel Time Estimation, Xi’an (d) Path Ranking, Xi’an

Figure 5: Effect of pre-training


       
5XQWLPH 0$( 5XQWLPH 5XQWLPH
7HVWLQJ PVSDWK

7HVWLQJ PVSDWK

7HVWLQJ PVSDWK
5XQWLPH 0$(

7HVWLQJ PVSDWK
    
  
0$(

0$(
    
  
    
 î î î î î   î î î î î   î î î î î   î î î î î 
1XPEHURI3DWFKHV 1XPEHURI3DWFKHV 1XPEHURI3DWFKHV 1XPEHURI3DWFKHV
(a) Travel Time Estimation, Aalborg (b) Path Ranking, Aalborg (c) Travel Time Estimation, Xi’an (d) Path Ranking, Xi’an

Figure 6: Effect of varing granularity size in Aalborg

Table 5: Accuracy on travel time estimation and path ranking for short paths
Aalborg Xi’an
Methods Travel Time Estimation Path Ranking Travel Time Estimation Path Ranking
MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑ MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑
Node2vec [16] 67.898 0.306 55.179 0.201 0.140 0.153 179.828 0.269 32.512 0.220 0.081 0.084
PIM [3] 55.871 0.252 44.605 0.150 0.292 0.348 150.356 0.238 26.221 0.206 0.174 0.198
Lightpath [45] 52.708 0.237 40.414 0.127 0.445 0.502 149.816 0.237 26.702 0.180 0.201 0.249
TrajCL [5] 49.427 0.223 34.466 0.116 0.523 0.597 150.657 0.239 26.891 0.183 0.219 0.260
START [21] 47.347 0.216 36.523 0.122 0.513 0.586 147.745 0.227 25.973 0.179 0.219 0.277
CLIP [37] 62.300 0.281 49.164 0.166 0.176 0.217 170.818 0.271 33.201 0.215 0.079 0.095
USPM [8] 57.415 0.274 48.949 0.152 0.314 0.360 149.012 0.234 26.942 0.207 0.133 0.151
JGRM [29] 47.770 0.213 33.715 0.117 0.516 0.612 148.832 0.236 26.341 0.176 0.221 0.262
Lightpath+image 54.819 0.247 42.258 0.134 0.383 0.434 150.657 0.239 26.891 0.184 0.186 0.212
START+image 49.427 0.223 34.462 0.125 0.476 0.548 148.684 0.236 26.503 0.185 0.181 0.231
MM-Path 44.092 0.202 31.058 0.110 0.562 0.662 138.543 0.218 25.112 0.164 0.253 0.294
Improvement 6.874% 6.481% 9.809% 5.172% 7.456% 10.887% 6.228% 3.964% 3.315% 8.379% 15.525% 5.755%
Improvement* 7.699% 5.164% 7.881% 5.982% 8.914% 8.169% 6.821% 7.234% 4.665% 6.818% 14.479% 12.213%

Table 6: Accuracy on travel time estimation and path ranking for long paths
Aalborg Xi’an
Methods Travel Time Estimation Path Ranking Travel Time Estimation Path Ranking
MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑ MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑
Node2vec [16] 193.021 0.295 25.416 0.231 0.019 -0.008 262.160 0.279 28.294 0.171 0.015 0.108
PIM [3] 135.348 0.207 17.855 0.098 0.210 0.239 246.606 0.246 24.981 0.186 0.179 0.207
Lightpath [45] 114.599 0.175 16.079 0.080 0.371 0.326 241.666 0.237 24.752 0.107 0.259 0.317
TrajCL [5] 99.041 0.151 14.215 0.077 0.346 0.372 243.581 0.238 24.951 0.111 0.266 0.328
START [21] 87.890 0.136 13.220 0.078 0.328 0.364 237.245 0.230 23.973 0.108 0.244 0.301
CLIP [37] 174.421 0.266 23.065 0.126 0.145 0.172 255.160 0.249 28.752 0.208 0.093 0.104
USPM [8] 147.236 0.232 24.435 0.105 0.201 0.218 243.380 0.240 24.549 0.157 0.216 0.277
JGRM [29] 89.125 0.138 13.131 0.076 0.381 0.379 239.386 0.234 24.379 0.109 0.265 0.319
Lightpath+image 111.703 0.171 15.897 0.092 0.326 0.308 243.581 0.238 24.951 0.112 0.236 0.323
START+image 99.041 0.151 14.215 0.083 0.317 0.323 237.347 0.232 24.641 0.113 0.258 0.312
MM-Path 83.125 0.121 11.214 0.067 0.421 0.424 224.227 0.190 21.832 0.103 0.279 0.345
Improvement 5.421% 11.029% 15.174% 12.987% 13.477% 13.978% 3.833% 17.391% 8.931% 3.738% 2.307% 5.182%
Improvement* 6.732% 12.318% 14.599% 11.842% 10.499% 11.873% 5.527% 18.103% 10.447% 5.504% 5.283% 7.836%
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

ranking. We observe that the performance of both models improves


with an increase in labeled data size, and the pre-trained model con-
Path 2
sistently outperforms the no pre-trained model. This illustrates that Path 1

the pre-trained model, equipped with extensive cross-modal context


information, requires less labeled data and achieves superior perfor-
mance compared to the model without pre-training. These findings
suggest that MM-Path can effectively serve as a pre-training model
to enhance supervised learning methods.

4.2.4 Parameter Sensitivity. We explore the impact of image


Figure 7: Visualization of two paths
granularity size on the model’s performance. For uniform segmenta-
tion, each 500 × 500 pixel image is segmented into 1×1, 2×2, 4×4,
5×5, and 10×10 patches, respectively. The performance and testing
runtime for each granularity size are detailed in Figure 6. Table 7: Travel time estimation results of different models
We observe that a larger number of patches also incurs longer
inference runtime for each path. Moreover, the model’s performance Ground Travel Time Estimation
improves as the number of patches increases from 1×1 to 4×4, Case
truth MM-Path TrajCL START JGRM START+image
suggesting that finer granularity enhances the capture of detailed fea- Path 1 17.0 22.7 30.0 32.9 26.9 29.2
tures. However, performance begins to decline with further increases Path 2 36.0 40.1 17.1 24.3 73.8 26.5
to 5×5 and 10×10 patches. This decrease is due to the limited context
features extracted by excessively fine granularity, which negatively
impacts path understanding.
Then, we explore the model scalability in terms of road path
length. We evaluate the performance of all models on paths with 5 Related work
varying numbers of nodes. Specifically, paths with fewer than 50 5.1 Path Representation Learning Models
nodes are classified as short paths, while those with more than 50
With the advancement of location-based services, the trajectory gen-
nodes are classified as long paths. The performance of all models on
eration [51, 52] and trajectory analysis [6, 10, 13, 27, 28] have
short and long paths is detailed in Tables 5 and 6, respectively. The
become increasingly prevalent. To better understand static represen-
experimental results show that MM-Path outperforms other models
tation of trajectories, recent studies have increasingly focused on
across both path lengths, demonstrating its superiority.
developing path representation learning models that do not rely on
labeled training data, showing robust generalization across multiple
downstream tasks [5, 21, 29, 45]. For instance, Jiang et al. [21] intro-
4.2.5 Case Study. We inspect a pair of representative paths in duce a self-supervised trajectory representation learning framework
Aalborg to demonstrate the superiority of MM-Path. The road paths that includes tasks such as span-masked trajectory recovery and tra-
and image paths are visualized in Figure 7. The travel time estimation jectory contrastive learning to leverage temporal patterns and travel
results of MM-Path and the superior baselines are shown in Table 7. semantics effectively. Yang et al. [45] aim to minimize resource con-
Two paths in Figure 7 exhibit a similar structure on the road sumption and enhance model scalability by developing LightPath, a
network, both having a node degree sequence of ⟨3, 2, 3, 3, 3⟩. Such lightweight and scalable framework designed to conserve resources
single-modal data might suggest that these paths have comparable while maintaining accuracy. Additionally, Ma et al. [29] propose
travel times. However, the visual information from their images a representation learning framework that integrates GPS and route
differs significantly. Specifically, path 1 traverses a roundabout and modeling based on self-supervised technology, further expanding
runs along a trunk road, where paths typically allow for faster travel the field’s methodologies.
speeds. In contrast, path 2 is located on an ordinary road near resi- Recent advancements in Large Language Models (LLMs) [38,
dential buildings, typically associated with slower speeds. 39] have facilitated the development of general spatio-temporal
As shown in Table 7, the travel time estimates from TrajCL, prediction models [23–25, 48, 50]. For example, Unist [48] maps
START, and START+image suggest a shorter travel time for path 2 spatio-temporal data onto grids and provides generic predictions
and a longer travel time for path 1, which are contrary to the ground across various scenarios using elaborated masking strategies and
truth. This illustrates that a single modality provides limited infor- spatio-temporal knowledge-guided prompts. However, since each
mation, and the simple concatenation of multi-modal data in the path exhibits geographical continuity and cannot be discretized into
START+image model fails to effectively extract image information. a single region, these models are not well-suited for path modeling.
In contrast, the results from the multi-modal model JGRM align with Despite recent advancements, current path representation learning
the relative magnitudes of the actual travel times, but its estimates models overlook the potential contributions of images in path un-
exhibit large deviations. Meanwhile, MM-Path demonstrates supe- derstanding, which can provide valuable insights into the geometric
rior travel time estimation performance compared to other models, features and contextual environmental information from a global
indicating its effective fusion and utilization of image information. perspective.
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

5.2 Multi-modal Pre-trained Models [7] Lu Chen, Qilu Zhong, Xiaokui Xiao, Yunjun Gao, Pengfei Jin, and Christian S
Jensen. 2018. Price-and-time-aware dynamic ridesharing. In ICDE. IEEE, 1061–
Multi-modal pre-trained models aim to enhance target representa- 1072.
tion by leveraging diverse modalities, including text, images, and [8] Meng Chen, Zechen Li, Weiming Huang, Yongshun Gong, and Yilong Yin. 2024.
Profiling urban streets: A semi-supervised prediction model based on street view
audio. Some methodologies converge the information from various imagery and spatial topology. In KDD. 319–328.
modalities into a unified latent space [37, 41]. For example, Wang et [9] Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong
al. [41] propose an unsupervised multi-modal method that encodes Wen, Bin Yang, and Chenjuan Guo. 2024. Pathformer: Multi-scale transformers
with adaptive pathways for time series forecasting. In ICLR.
visual, textual, and geospatial data of urban neighborhoods into a [10] Yile Chen, Gao Cong, and Cuauhtemoc Anda. 2023. Teri: An effective framework
unified vector space using multiple triplet loss functions. Radford for trajectory recovery with irregular time intervals. PVLDB 17, 3 (2023), 414–
et al. [37] develop CLIP, a large-scale multi-modal model that eval- 426.
[11] Yile Chen, Xiucheng Li, Gao Cong, Zhifeng Bao, Cheng Long, Yiding Liu,
uates the congruence of image-text pairs through cosine similarity. Arun Kumar Chandran, and Richard Ellison. 2021. Robust road network rep-
Some research opts to map each modality into distinct embedding resentation learning: When traffic patterns meet traveling semantics. In CIKM.
211–220.
spaces, while enforcing fusion on the representations [2, 36, 37]. For [12] Jian Dai, Bin Yang, Chenjuan Guo, and Zhiming Ding. 2015. Personalized route
instance, Pramanick et al. [36] achieve robust video-text represen- recommendation using big trajectory data. In ICDE. IEEE, 543–554.
tation by embedding cross-modal fusion within the core video and [13] Xin Ding, Lu Chen, Yunjun Gao, Christian S Jensen, and Hujun Bao. 2018.
UlTraMan: A unified platform for big trajectory data management and analytics.
language structures, thereby facilitating various downstream tasks PVLDB 11, 7 (2018), 787–799.
and reducing fine-tuning requirements. Bao et al. [2] present VLMO, [14] Ahmed Elbery, Hossam S Hassanein, Nizar Zorba, and Hesham A Rakha. 2020.
a unified vision-language pre-trained model that jointly trains a Iot-based crowd management framework for departure control and navigation.
IEEE Trans. Veh. Technol. 70, 1 (2020), 95–106.
dual encoder and a fusion encoder within a modular Transformer [15] Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and
network. Rebecca Moore. 2017. Google Earth Engine: Planetary-scale geospatial analysis
for everyone. Remote Sens. Environ. (2017).
Nevertheless, these models are predominantly trained on general [16] Aditya Grover and Jure Leskovec. 2016. Node2vec: Salable feature learning for
corpora. The characteristics of road paths and image paths, which networks. In KDD. 855–864.
include complex correspondences and spatial topological relation- [17] Chenjuan Guo, Christian S. Jensen, and Bin Yang. 2014. Towards total traffic
awareness. SIGMOD Record 43, 3 (2014), 18–23.
ships, are distinct from those found in conventional multi-modal [18] Chenjuan Guo, Ronghui Xu, Bin Yang, Ye Yuan, Tung Kieu, Yan Zhao, and
datasets. Consequently, existing multi-modal models exhibit limited Christian S Jensen. 2024. Efficient stochastic routing in path-centric uncertain
generalizability to this specialized domain. road networks. PVLDB 17, 11 (2024), 2893–2905.
[19] Chenjuan Guo, Bin Yang, Jilin Hu, and Christian Jensen. 2018. Learning to route
with sparse trajectory sets. In ICDE. IEEE, 1073–1084.
6 Conclusion and future work [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In CVPR. 770–778.
In this paper, we propose a Multi-modal Multi-granularity Path Rep- [21] Jiawei Jiang, Dayan Pan, Houxing Ren, Xiaohan Jiang, Chao Li, and Jingyuan
resentation Learning Framework (MM-Path), which is the first work Wang. 2023. Self-supervised trajectory representation learning with temporal
regularities and travel semantics. In ICDE. IEEE.
that integrate data from road networks and remote sensing images [22] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT:
into generic path representation learning. Initially, we model the road Pre-training of deep bidirectional transformers for language understanding. In
paths and image paths separately, and implement a multi-granularity NAACL-HLT. 4171–4186.
[23] Duc Kieu, Tung Kieu, Peng Han, Bin Yang, Christian S. Jensen, and Bac Le. 2024.
alignment strategy to ensure the synchronization of both detailed TEAM: Topological evolution-aware framework for traffic forecasting. PVLDB
local information and broader global context. Furthermore, we de- 18 (2024).
velop a graph-based cross-modal residual fusion component that [24] Zhonghang Li, Lianghao Xia, Jiabin Tang, Yong Xu, Lei Shi, Long Xia, Dawei Yin,
and Chao Huang. 2024. UrbanGPT: Spatio-Temporal Large Language Models.
effectively fuses information from both modalities while preserving arXiv:2403.00813 [cs.CL]
the semantic consistency between modalities. MM-Path outperforms [25] Zhonghang Li, Lianghao Xia, Yong Xu, and Chao Huang. 2024. FlashST: A simple
and universal prompt-tuning framework for traffic prediction. arXiv preprint
all baselines on two real-world datasets across two downstream tasks, arXiv:2405.17898 (2024).
demonstrating its superiority. [26] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2024. Foundations &
In the future, we plan to further investigate the capability of multi- trends in multimodal machine learning: Principles, challenges, and open questions.
ACM Comput. Surv. 56, 10 (2024), 1–42.
modal models for generic path representing learning, with particular [27] Ziqiao Liu, Hao Miao, Yan Zhao, Chenxi Liu, Kai Zheng, and Huan Li. 2024.
focus on few-shot and zero-shot learning scenarios. LightTR: A lightweight framework for federated trajectory recovery. arXiv
preprint arXiv:2405.03409 (2024).
[28] Yandi Lun, Hao Miao, Jiaxing Shen, Renzhi Wang, Xiang Wang, and Senzhang
References Wang. 2024. Resisting tul attack: balancing data privacy and utility on trajectory
[1] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. BEiT: BERT pre- via collaborative adversarial learning. GeoInformatica 28, 3 (2024), 381–401.
training of image Transformers. In ICLR. [29] Zhipeng Ma, Zheyan Tu, Xinhai Chen, Yan Zhang, Deguo Xia, Guyue Zhou, Yilun
[2] Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Chen, Yu Zheng, and Jiangtao Gong. 2024. More than routing: Joint GPS and
Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. Vlmo: Uni- route modeling for refine trajectory representation learning. In WWW. 3064–3075.
fied vision-language pre-training with mixture-of-modality-experts. NeurIPS 35 [30] Sachit Menon and Carl Vondrick. 2022. Visual classification via description from
(2022), 32897–32912. large language models. In ICLR.
[3] Sean Bin Yang, Chenjuan Guo, Jilin Hu, Bin Yang, Jian Tang, and Christian S. [31] Paul Newson and John Krumm. 2009. Hidden markov map matching through
Jensen. 2022. Weakly-supervised temporal path representation learning with noise and sparseness. In SIGSPATIAL. 336–343.
contrastive curriculum learning. In ICDE. 2873–2885. [32] Zhicheng Pan, Yihang Wang, Yingying Zhang, Sean Bin Yang, Yunyao Cheng,
[4] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph repre- Peng Chen, Chenjuan Guo, Qingsong Wen, Xiduo Tian, Yunliang Dou, et al. 2023.
sentations with global structural information. In CIKM. 891–900. Magicscaler: Uncertainty-aware, predictive autoscaling. PVLDB 16, 12 (2023),
[5] Yanchuan Chang, Jianzhong Qi, Yuxuan Liang, and Egemen Tanin. 2023. Con- 3808–3821.
trastive trajectory similarity learning with dual-feature attention. In ICDE. IEEE, [33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-
2933–2945. gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
[6] Lu Chen, Yunjun Gao, Ziquan Fang, Xiaoye Miao, Christian S Jensen, and Chen- 2019. Pytorch: An imperative style, high-performance deep learning library.
juan Guo. 2019. Real-time distributed co-movement pattern detection on streaming [34] Simon Aagaard Pedersen, Bin Yang, and Christian S. Jensen. 2020. Anytime
trajectories. PVLDB 12, 10 (2019), 1208–1220. stochastic routing with hybrid learning. PVLDB 13, 9 (2020), 1555–1567.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.

[35] Simon Aagaard Pedersen, Bin Yang, Christian S. Jensen, and Jesper Møller. 2023. [43] Bin Yang, Jian Dai, Chenjuan Guo, Christian S Jensen, and Jilin Hu. 2018. PACE:
Stochastic routing with arrival windows. ACM Trans. Spatial Algorithms Syst. 9, a PAth-CEntric paradigm for stochastic path finding. VLDB J. 27 (2018), 153–178.
4 (2023), 30:1–30:48. [44] Sean Bin Yang, Chenjuan Guo, and Bin Yang. 2020. Context-aware path ranking
[36] Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, in road networks. IEEE Trans. Knowl. Data Eng. 34, 7 (2020), 3153–3168.
Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. 2023. EgoVLPv2: [45] Sean Bin Yang, Jilin Hu, Chenjuan Guo, Bin Yang, and Christian S. Jensen. 2023.
Egocentric video-language pre-training with fusion in the backbone. In ICCV. LightPath: Lightweight and scalable path representation learning. In KDD. ACM,
5285–5297. 2999–3010.
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, [46] Sean Bin Yang and Bin Yang. 2020. Learning to rank paths in spatial networks. In
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, ICDE. IEEE, 2006–2009.
et al. 2021. Learning transferable visual models from natural language supervision. [47] Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi,
In ICML. 8748–8763. Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, et al. 2023. i-code: An integrative
[38] Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, and composable multimodal learning framework. In AAAI, Vol. 37. 10880–10890.
Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. 2022. UL2: [48] Yuan Yuan, Jingtao Ding, Jie Feng, Depeng Jin, and Yong Li. 2024. UniST:
Unifying language learning paradigms. In ICLR. A prompt-empowered universal model for urban spatio-temporal prediction. In
[39] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul- KDD.
shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. [49] Sen Zhang, Senzhang Wang, Xiang Wang, Shigeng Zhang, Hao Miao, and Junxing
2022. Lamda: Language models for dialog applications. arXiv preprint Zhu. 2022. Multi-task adversarial learning for semi-supervised trajectory-user
arXiv:2201.08239 (2022). linking. In ECML PKDD. Springer, 418–434.
[40] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, [50] Kai Zhao, Chenjuan Guo, Yunyao Cheng, Peng Han, Miao Zhang, and Bin Yang.
Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2023. Multiple time series forecasting with dynamic graph modeling. PVLDB 17,
2023. Image as a foreign language: Beit pretraining for vision and vision-language 4 (2023), 753–765.
tasks. In CVPR. 19175–19186. [51] Yuanshao Zhu, Yongchao Ye, Shiyao Zhang, Xiangyu Zhao, and James Yu. 2023.
[41] Zhecheng Wang, Haoyuan Li, and Ram Rajagopal. 2020. Urban2vec: Incorporat- Difftraj: Generating gps trajectory with diffusion probabilistic model. NeurIPS 36
ing street view imagery and pois for multi-modal urban neighborhood embedding. (2023), 65168–65188.
In AAAI, Vol. 34. 1013–1020. [52] Yuanshao Zhu, James Jianqiao Yu, Xiangyu Zhao, Qidong Liu, Yongchao Ye,
[42] Ronghui Xu, Weiming Huang, Jun Zhao, Meng Chen, and Liqiang Nie. 2023. A Wei Chen, Zijian Zhang, Xuetao Wei, and Yuxuan Liang. 2024. Controltraj:
spatial and adversarial representation learning approach for land use classification Controllable trajectory generation with topology-constrained diffusion model. In
with POIs. ACM Trans. Intell. Syst. Technol. 14, 6, Article 114 (nov 2023), KDD. 4676–4687.
25 pages.
Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

You might also like