Multi-Modal, Multi-Granularity Path Representation Learning - Extended Version.18428v1
Multi-Modal, Multi-Granularity Path Representation Learning - Extended Version.18428v1
Learning—Extended Version
Ronghui Xu1 , Hanyin Cheng1 , Chenjuan Guo1 , Hongfan Gao1 , Jilin Hu1 , Sean Bin Yang2 , Bin Yang1 *
1 East
China Normal University, 2 Chongqing University of Posts and Telecommunications
{rhxu,hycheng,hf.gao}@stu.ecnu.edu.cn,{cjguo,jlhu,byang}@dase.ecnu.edu.cn,[email protected]
Abstract Road path
(a) (b)
Developing effective path representations has become increasingly
𝑣4 𝑣5 𝑣7 𝑣8
essential across various fields within intelligent transportation. Al- 𝑣2 𝑣3
𝑣6
arXiv:2411.18428v1 [cs.LG] 27 Nov 2024
comprehensive context essential for a complete understanding of intersections, road sub-paths, and entire road paths with their cor-
paths. This calls for developing a multi-modal pre-trained path rep- responding image information to capture details accurately at a
resentation learning model. Nonetheless, constructing such a model finer granularity as well as maintaining global correspondence at a
faces several challenges: coarser granularity. Specifically, we divide the image of the entire
Information granularity discrepancies between road paths interested region into small fixed-size images, collect the small fixed-
and image paths significantly hinder cross-modal semantic align- size images along each path, and arrange the collected images into
ment. Effective cross-modal alignment, which ensures semantic an image path (i.e., image sequence). We employ modal-specific
consistency and complementarity among various modalities, is cru- tokenizers to generate the initial embeddings for road paths and
cial for constructing multi-modal models [26]. However, the dis- image paths, respectively. Subsequently, these initial embeddings
crepancies in information granularity between road paths and image are fed into the powerful Transformer architecture to learn complex
paths are substantial. As depicted in Figure 1, road paths typically encoded embeddings for each modality at three granularities. Finally,
focus on detailed topological structures and delineate road connec- a multi-granularity alignment loss function is employed to ensure the
tivity, while the image paths capture global environmental contexts alignment of road and image encoded embeddings across different
on a large scale, reflecting the functional attributes of corresponding granularities.
regions. It is worth noting that images may include extensive regions To address the second challenge, we introduce a graph-based
that show low relevance to the road paths, such as the dark regions in cross-modal residual fusion component, which is designed to ef-
Figure 1 (c). Current image-text multi-modal models [2, 30, 37, 40] fectively fuse cross-modal features while incorporating spatial con-
typically align individual images with textual sequences. However, textual information. Specifically, we link the encoded embeddings
such single-granularity and coarse alignment methods introduce of each modality with the initial embeddings of the other modality
noise, which are not suitable for the precise alignment required for to create road and image residual embeddings, respectively, with
paths. Additionally, as shown in Figure 1 (a), roads have different the purpose of fusing cross-modal features from different stages.
granularities in nature, including intersections, road segments, and We then build a cross-modal adjacency matrix for each path based
sub-roads. Fully understanding paths at different granularities can on spatial correspondences and contextual information. This ma-
provide insights from micro to macro levels, mitigating the neg- trix guides the GCN to iteratively fuse the residual embeddings
ative effects caused by the differences in information granularity of each modality separately, thus obtaining road and image fused
across modalities. Although some studies [9, 32] have explored embeddings. Finally, we apply contrastive loss to ensure the con-
multi-granularity in single-modal data, they have not adequately sistency of the fused embeddings across the two modalities. As the
addressed the requirements for multi-granularity analysis in multi- final representation effectively integrates cross-stage features of the
modal contexts. Thus, it is crucial to refine multi-granularity data two modalities with spatial context information, this component
processing and develop multi-granularity methods for cross-modal not only achieves deep multi-modal fusion but also enhances the
alignment. comprehensive utilization of information.
The inherent heterogeneity of road paths and image paths The contributions of this work are delineated as follows:
poses a significant challenge during feature fusion. The differ- • We propose a Multi-modal, Multi-granularity Path Represen-
ences in data structure and information granularity between road tation Learning Framework that learns generic path represen-
paths and image paths extend to their learning methods. Road path tations applicable to various downstream tasks. To the best
representation learning typically focuses on connectivity and reacha- of our knowledge, MM-Path is the first model that leverages
bility between roads and intersections, as well as analyzing graph road network data and remote sensing images to learn generic
structures [3, 21, 44, 45]. Conversely, image learning methods that path representations.
are able to learn image paths prioritize object recognition and fea- • We model and align the multi-modal path information using
ture extraction, aiming for a broad understanding of image con- a fine-to-coarse multi-granularity alignment strategy. This
tent [1, 20]. These disparate learning methods lead to road paths strategy effectively captures both intricate local details and
and image paths mapped to different embedding spaces, resulting the broader global context of the path.
in feature dimensions with similar semantics containing entirely • We introduce a graph-based cross-modal residual fusion com-
different information. Simple fusion methods like early fusion (i.e., ponent. This component utilizes a cross-modal GCN to fully
integrating multiple modalities before or during the feature extrac- integrate information from different modalities while main-
tion stage) and late fusion (i.e., keeping each modality independently taining the consistency of dual modalities.
processed until the final fusion stage) may result in information loss • We conduct extensive experiments on diverse tasks using
and increased bias, and fail to capture subtle correlations between two real-world datasets to demonstrate the adaptability and
road paths and image paths [26, 47]. Therefore, a multi-modal fusion superiority of our model.
method that can capture the relationships among entities in different
modalities and ensuring effective data fusion, is critically needed. 2 Preliminaries
To address these challenges, we propose a Multi-modal, Multi-
granularity Path Representation Learning Framework, namely MM- 2.1 Basic Conception
Path, for learning generic path representations. Path. A path 𝑝 is a sequence of continuous junctions, which can be
To address the first challenge, we propose a multi-granularity observed from the road network view and the image view.
alignment component. This component systematically associates Road network. A road network is denoted as G = (V, E), where
V and E represent a set of nodes and edges, respectively. Node
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Image-Transformer
Multi-granularity embedding
𝑚1 𝑚2 𝑚3
AVGPooling
Tokenizer
sep sep Loss
Patch
Image
…
Patch Image
sep sep path
Image path Multi-layer
Node-unrelated token Image token GCN
…
sep sep
Node-related token Image path token
sep cls Contrastive
loss
Road Path Encoding Branch
Road-Transformer
Random Masking
Node Tokenizer
𝑣3
…
…
AVGPooling
𝑣7 𝑣8 sep sep sep cls
𝑣2 𝑣3 𝑣4 𝑣5 𝑣6
𝑣𝑖
…
…
𝑣1 sep sep
…
GCN
…
Node token
sep sep embedding
Road sub-path token Road path token
reduced. Then, we utilize specialized tokenizers to separately encode the road path R (𝑝) to a node embedding v𝑖 ∈ R𝑑 , initialized using
the data of each modality into a unified format. Node2vec [16]. We also integrate standard learnable road position
The patch tokenizer segments each image within an image path embeddings Troad ∈ R𝑛2 ×𝑑 with the node embeddings to generate
into a series of patches to extract fine-grained semantic informa- the road initial embeddings P (0) = [vcls, v1, ..., v | R (𝑝 ) | , vsep ] +Troad ,
tion. Specifically, as shown in Figure 2, an image 𝑚𝑖 ∈ R𝑟 ×𝑟 ×𝑐 where vcls and vsep represent the road initial embeddings of the [cls]
is reshaped into a sequence of 𝑟 2 /𝑜 2 (e.g., 16) patches, where 𝑐 and [sep] tokens, respectively. 𝑛 2 denotes the length of the road path
represents the number of channels, (𝑟, 𝑟 ) denotes the resolution token sequence.
of fixed-size image, and (𝑜, 𝑜) defines the resolution per patch.
3.2.2 Image Path Encoding. To model the image path, the image
After patching, we concatenate the patch sequences from all im-
ages within a image path to form a unified patch sequence. Then, initial embeddings H (0) are fed into the 𝑙 layers Image-Transformer,
we place a special [cls] token at the beginning of the patch se- which can be formulated as,
quence. As the [cls] token captures the global information of the H ( 𝑗 ) = Image-Transformer(H ( 𝑗 −1) ), (2)
entire sequence [1], it can be regarded as a representation of the
where 𝑗 = 1, ..., 𝑙, and H (𝑙 ) ∈ R𝑛1 ×𝑑 is the final output of the Image-
entire image path. Special [sep] tokens are placed at the end of
Transformer. For a simplify, we denote the image encoded embed-
each image to delineate local information of each image. For ex- (1) (𝑟 2 /𝑜 2 ) (𝑗)
ample, the token sequence of image path M (𝑝) = ⟨𝑚 1, 𝑚 2, 𝑚 3 ⟩ is dings H (𝑙 ) as H = [hcls, h1 , ..., h | M (𝑝 ) | , hsep|M (𝑝 ) | ]. h𝑖 ∈ R𝑑
(1) (16) (1) (16) (𝑗) denotes the encoded embedding of the 𝑗-th patch of the 𝑖-th image.
[cls, 𝑚 1 , . . . , 𝑚 1 , sep, . . . , sep, m3 , . . . , 𝑚 3 , sep], where 𝑚𝑖 ∈
R𝑜 ×𝑜 ×𝑐 denotes the 𝑗-th patch of the 𝑖-th image. hsep𝑖 , hcls ∈ R𝑑 represent the encoded embeddings of the 𝑖-th image
(𝑗) (𝑗) and the whole image path M (𝑝), respectively.
Each patch 𝑚𝑖 is then projected into a patch embedding m𝑖 ∈
𝑑
R , which can be initialized using pre-trained ResNet50 [20]. The 3.2.3 Road Path Encoding. To facilitate alignment between road
image initial embeddings are computed by summing the patch em- paths and image paths, we utilize a similar Transformer architecture
beddings with the image position embeddings Timage ∈ R𝑛1 ×𝑑 , for road path modeling. The encoder is comprised of 𝑙 layers of
(1) (𝑟 2 /𝑜 2 ) Transformer blocks, defined as:
resulting in H (0) = [mcls, m1 , ..., m | M (𝑝 ) | , msep ] + Timage . Here,
mcls and msep are the image initial embeddings of the [cls] and P ( 𝑗 ) = Road-Transformer(P ( 𝑗 −1) ), (3)
[sep] tokens, respectively. 𝑛 1 denotes the length of the patch token where 𝑗 = 1, ..., 𝑙, P( 𝑗 ) R𝑛2 ×𝑑
∈ is the output of the 𝑗-th layer. In
sequence, and 𝑑 represents the dimension of the embeddings. a brief, the road encoded embeddings P (𝑙 ) are represented as P =
The modeling for a road path is similar, starting with a [cls] to- [pcls, p1, . . . , p | R (𝑝 ) | , psep|M (𝑝 ) | ]. Here, p𝑖 , psep𝑖 , pcls ∈ R𝑑 denote
ken at the beginning and placing [sep] tokens at the end of each the encoded embeddings of the 𝑖-th node, the sub-path 𝑠𝑖 , and the
road sub-path. For instance, a road path R (𝑝) comprising three entire road path R (𝑝), respectively. |R (𝑝)| represents the number
road sub-paths—𝑠 1 = ⟨𝑣 1, 𝑣 2, 𝑣 3 ⟩, 𝑠 2 = ⟨𝑣 4, 𝑣 5, 𝑣 6 ⟩, and 𝑠 3 = ⟨𝑣 7, 𝑣 8 ⟩— of nodes in road path R (𝑝), and |M (𝑝)| indicates the number of
generates the node token sequence [cls, 𝑣 1, 𝑣 2, 𝑣 3, sep, 𝑣 4, 𝑣 5, 𝑣 6, sep, 𝑣 7, images in image path M (𝑝), which also corresponds to the number
𝑣 8, sep]. Then, the node tokenizer linearly projects each node 𝑣𝑖 from of road sub-paths.
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
To better capture the complex dependencies in a path, similar for coarse-grained data. Considering a batch of road path-image
to masked language modeling task (MLM) [22], we use a masked path pairs B, the objective of this contrastive learning loss is to
node modeling task as a self-supervised task. The intuition behind accurately identify the matched pairs among the |B| × |B| possible
this is that the information density of individual pixels or patches combinations. Within a training batch, there are |B| 2 − |B| negative
in an image is relatively low compared to the topological structure pairs. The contrastive learning loss function can be formulated as:
information relevant to the path. Therefore, image masking tasks ∑︁ exp(sim(pcls, hcls )/𝜎)
are not deemed essential. To this end, we propose to employ node Lcoarse = − (log( Í Neg
masking tasks. In particular, we randomly mask the nodes in road 𝑝∈P 𝑚 Neg ∈ B exp(sim(pcls , hcls )/𝜎)
(7)
paths (Ref. as to the gray triangle in Figure 3) , and then use a exp(sim(pcls, hcls )/𝜎)
+Í Neg
),
softmax classifier to predict the node tokens corresponding to the
𝑝 Neg ∈ B exp(sim(pcls , hcls )/𝜎)
masked nodes. The loss function for training is defined as follows:
∑︁ ∑︁ where pcls and hcls (Ref. as to the dark yellow rectangle and the
Lmask = − log P(𝑣𝑖 |𝑣𝑖mask ), dark green triangle with dark blue borders in Figure 3) correspond
(4)
𝑝∈P 𝑖∈D to the encoded embeddings of the entire road path and image path in
where P represents the training sets of all paths, D is randomly a road path-image path pairs. 𝑝 Neg and 𝑚 Neg are the negative road
masked positions of road path, and 𝑣𝑖mask is the node that is masked path and image path in the batch set B, respectively. 𝜎 is a learned
according to D. temperature parameter. sim(pcls, hcls ) returns the Euclidean distance
between pcls and hcls .
3.2.4 Modalities Aligning. The encoded embeddings from each
Finally, the multi-granularity loss can be formulated as Lmulti =
branch capture the hidden semantic information within their re-
Lfine + Lmedium + Lcoarse .
spective modality, including fine-grained node/patch embeddings,
medium-grained road sub-path/image embeddings, and coarse-grained 3.3 Graph-based Cross-modal Residual Fusion
entire road path/image path embeddings. We aim for embeddings
with similar semantics across modalities to be proximate within the In this framework, each path is represented through two distinct
embedding space. Additionally, we seek detailed alignment between modalities, providing complementary perspectives. To effectively
the two modalities while maintaining global correspondence. Ac- leverage these modalities, we propose a graph-based cross-modal
cordingly, we design a loss function that operates at three distinct residual fusion component.
levels of granularity—fine, medium and coarse—corresponding to 3.3.1 Cross-modal Residual Connection. To facilitate com-
node/patch, road sub-path/image, and entire road path/image path, prehensive information exchange between modalities, we introduce
respectively. cross-modal residual connections that effectively concatenate em-
Fine granularity. Since each patch may contain more than one beddings across different stages and modalities. These connections
node, the encoded embeddings of a node and the corresponding patch enable direct propagation of gradients to earlier layers, thereby en-
(Ref. as to the dark yellow triangle the dark green rectangle with hancing stability and improving training efficiency. Specifically, we
yellow borders in Figure 3) should maintain directional consistency. concatenate the road initial embeddings P (0) with the image en-
To precisely capture the semantic information of fine-grained paths, coded embeddings H, and the image initial embeddings H (0) with
we minimize the cosine distance between the encoded embeddings of the road encoded embeddings P. The resulting image residual embed-
nodes and their corresponding patches. Consequently, we construct dings and road residual embeddings are defined as U = P (0) ∥H and
the fine-grained loss function as follows: Q = P∥H (0) , respectively. Here, U, Q ∈ R (𝑛1 +𝑛2 ) ×𝑑 , and ∥ denotes
(𝑘 ) the concatenation operation.
∑︁ ∑︁ p𝑖 · h 𝑗
Lfine = (1 − ), (5)
(𝑘 )
∥p𝑖 ∥∥h 𝑗 ∥ 3.3.2 Graph-based Fusion. Although traditional attention mech-
𝑝 ∈ P 𝑣 ∈ R (𝑝 ),L(𝑣 )=𝑚 (𝑘 )
𝑖 𝑖 𝑗 anisms proficiently identify correlations among entities, they often
where P represents the training set of paths, L(𝑣𝑖 ) is a function that fail to incorporate contextual information concurrently. To address
(𝑘 )
returns the patch corresponding to the node 𝑣𝑖 , p𝑖 and h 𝑗 are the this limitation, we utilize graph neural networks [4], which incorpo-
(𝑘 )
rate contextual information into the learning process by representing
encoded embeddings of 𝑣𝑖 and 𝑚 𝑗 , respectively. it as graph structures. Leveraging this capability, we introduce a
Medium granularity. Similarly, to align road sub-paths with graph-based fusion method to enhance the accuracy of information
images, we construct the following medium-grained loss function: understanding across different modalities.
∑︁ ∑︁ psep𝑖 · hsep𝑖 Initially, we construct a specialized cross-modal directed graph
Lmedium = (1 − ), (6) for each path. This graph treats all tokens, including [cls] and [sep]
∥psep𝑖 ∥∥hsep𝑖 ∥
𝑝 ∈ P 𝑠𝑖 ∈ R (𝑝 ) tokens from both modalities, as entities. These entities are connected
where psep𝑖 and hsep𝑖 (Ref. as to the dark yellow rectangle and the via three types of relationships: intra-modal context, cross-modal
dark green triangle with blue borders in Figure 3) are the encoded correspondence, and cross-modal context. The intra-modal context
embeddings of road sub-path 𝑠𝑖 and the corresponding image 𝑚𝑖 , focuses on interactions within a single modality, facilitating a deep
respectively. understanding of its specific information. Cross-modal correspon-
Coarse granularity. Due to the unique correspondence between dence aids in comprehending and learning the spatial correspondence
road path and the corresponding image path, a clearer distinction between different modalities. Cross-modal context addresses indi-
is necessary. Therefore, We construct a contrastive loss function rect relationships between different modal entities, which enhances
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.
the model’s ability to interpret complex scenes. Collectively, lever- where 𝛽 is a hyperparameter that controls the margin of the dis-
aging these relationships significantly boosts the model’s capacity tance between pairs of positive and negative samples, and [·] + is a
to handle multi-modal data effectively. shorthand for max(0, ·).
Figure 4 demonstrates the construction of the graph. Taking node
𝑣 4 as an example, node 𝑣 4 is connected by directed edges from five 3.4 Training objective
entities: adjacent context nodes 𝑣 3 and 𝑣 5 , its corresponding patch The final training objective of our model integrates all previously
L(𝑣 4 ), and their respective patches L(𝑣 3 ) and L(𝑣 5 ). The patch L(𝑣 4 ) proposed loss functions, formulated as follows:
(𝑗)
(i.e., 𝑚𝑖 ) is connected by directed edges from nine entities, includ- L = 𝜆mask Lmask + 𝜆multi Lmulti + 𝜆fuse Lfuse, (13)
ing 𝑣 3 , 𝑣 4 , 𝑣 5 , L(𝑣 3 ), L(𝑣 5 ), and four geographically adjacent image
( 𝑗 −1) ( 𝑗+1) (𝑗− 𝑟 ) ( 𝑗+ 𝑟 ) where 𝜆mask , 𝜆multi and 𝜆fuse are the weights assigned to Lmask ,
patches 𝑚𝑖 , 𝑚𝑖 , 𝑚𝑖 𝑜 and 𝑚𝑖 𝑜 , where 𝑜𝑟 denotes the
Lmulti , and Lfuse .
number of patches per row of an image. Additionally, the [sep] to-
After pre-training, we combine the image fused embedding y with
kens are connected by their context [sep] tokens, corresponding [sep]
the road fused embedding z into a generic path embedding x = y||z,
tokens from another modality, and cross-modal context tokens. The
achieving a more robust and generalized representation. The generic
[cls] token, encapsulating more global information, is connected by
path embedding is then fine-tuned using interchangeable linear layer
all [sep] tokens and corresponding [cls] token from another modality.
task heads, enabling the model to adapt to a variety of downstream
Then, we construct an adjacency matrix A ∈ R (𝑛1 +𝑛2 ) × (𝑛1 +𝑛2 )
tasks effectively.
for each path 𝑝 to capture the comprehensive relation within the
multi-modal data. Given the effectiveness of GCNs in transferring
4 Experiments
and fusing information across entities within a graph structure, we
employ a GCN to derive the updated embeddings for both branches. 4.1 Experimental Setups
The updated embeddings are computed as follows: 4.1.1 Datasets. We utilize the road networks, GPS datasets, and
remote sensing image datasets of two cities: Aalborg, Denmark, and
1 1
− −1 − −1
Û = Relu D̃ 2 ÃD̃ 2 Relu D̃ 2 ÃD̃ 2 UW1 W2 , (8) Xi’an, China. The road networks are sourced from OpenStreetMap1 ,
1 1 while the remote sensing image datasets are acquired from Google
− −1 − −1
Q̂ = Relu D̃ 2 ÃD̃ 2 Relu D̃ 2 ÃD̃ 2 QW3 W4 , (9) Earth Engine[15]. Employing an existing tool [31], we map-match
all GPS records to road networks to generate the path datasets and
where W1, W2, W3, W4 ∈ R𝑑 ×𝑑 are weight matrices, and 𝐷˜ is the historical trajectory datasets. The details of the datasets are shown
degree matrix of Ã. The augmented adjacency matrix à = A + I′ , in Table 1.
where I′ is a modified identity matrix with all diagonal elements set
4.1.2 Implementation Details. All experiments are conducted
to 1, except for those corresponding to patches without relationship
using PyTorch [33] on Python 3.8 and executed on an NVIDIA
to any nodes (Ref. as to the dark green rectangle with a white border
Tesla-A800 GPU. Each fixed-size image is 500 × 500 pixels, with
in Figure 3). This modification aims to exclude patches that are
each pixel corresponding to 2 meters on the earth. In other words, an
relatively unrelated to the path, thereby preventing the introduction
image covers a 1km × 1km region. We segment each image into 16
of noise into the model.
patches and set the embedding dimension 𝑑 to 64. Both the Image-
After iterative graph convolution operations, the embeddings of
Transformer and Road-Transformer comprise five layers. To enhance
each entity within the graph are updated. We perform average pool-
the pre-training efficiency, we initialize our Road-Transformer with
ing on Û and Q̂ to aggregate the updated embeddings, respectively.
the pre-trained LightPath [45]. The mask ratio is set at 15%. The
The fused embedding for each branch is then obtained by:
weights 𝜆mask , 𝜆fuse , and 𝜆multi are uniformly set to 1. Training
y = AvgPooling( Û), (10) proceeds for up to 60 epochs with a learning rate of 0.02. The linear
layer task head includes two fully connected layers, with dimensions
z = AvgPooling( Q̂), (11)
of 32 and 1, respectively. Training MM-Path on the Aalborg and
where y, z ∈ R𝑑
denote the image fused embedding and the road
fused embedding, respectively. 1 https://www.openstreetmap.org
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 1: Data statistics • USPM [8] utilizes both images and road network to profile
individual streets, but not paths (i.e., sequences of streets).
Aalborg Xi’an We adapt USPM to support path representation learning.
Number of nodes 7,561 7,051 • JGRM [29]: This is a representation learning framework
Number of edges 9,605 9,642 that combines GPS data and road network-based data. In this
AVG edge length (m) 124.78 86.49 study, we replace the GPS data with the image path.
• LightPath+image: This is a multi-modal variant of Light-
Number of path 47,865 200,000
Path. We concatenate the patch embedding with the node
AVG node number per road path 25.77 55.39
embedding to replace the original node embedding for train-
AVG path length (m) 3,252.75 4,743.70 ing.
Number of traj. 149,246 797,882 • START+image: This is a multi-modal variant of START,
Max travel time of traj. (s) 3,549 8,638 processed similarly to LightPath+image.
Avg travel time of traj. (s) 199 662
For all methods, we standardize the embedding dimensionality
Number of images 950 133 (𝑑) to 50. All parameters are set according to the specifications in the
AVG number of nodes per image 7.96 53.01 original papers. All baselines are fine-tuned using a linear layer task
AVG image number per image path 6.28 6.87 head. The output of this task head serves as the prediction result.
For all methods, we initially pre-train using unlabeled training
data (e.g., 30K unlabeled Aalborg dataset and 160K unlabeled Xi’an
dataset). Subsequently, we use a smaller volume of labeled data
Xi’an datasets takes 78 and 161 minutes, respectively. Since training (e.g., 10K labeled Aalborg dataset and 40K labeled Xi’an dataset)
is conducted offline, the runtime is acceptable. for task-specific fine-tuning. Validation and evaluation processes are
4.1.3 Downstream Tasks and Metrics. Path Travel Time Esti- conducted on separate validation dataset (e.g., 5K Aalborg dataset
mation: We calculate the average travel time (in seconds) for each and 20K Xi’an dataset) and test dataset (e.g., 10K Aalborg dataset
path based on historical trajectories. The accuracy of travel time and 40K Xi’an dataset), respectively.
estimations is evaluated using three metrics: Mean Absolute Error
(MAE), Mean Absolute Relative Error (MARE), and Mean Absolute 4.2 Experimental Results
Percentage Error (MAPE). Path Ranking Score Estimation (Path 4.2.1 Overall Performance. Table 2 presents the overall perfor-
Ranking): Each path is assigned a ranking score ranging from 0 mance on both tasks. We use ‘↑’ (and ‘↓’) to indicate that larger (and
to 1, derived from historical trajectories by following existing stud- smaller) values are better. For each task, we highlight the best and
ies [44–46]. We evaluate the effectiveness of path ranking using second-best performance in bold and underline. “Improvement” and
MAE, the Kendall rank correlation coefficient (𝜏), and Spearman’s “Improvement*” quantify the enhancements achieved by MM-Path
rank correlation coefficient (𝜌). over the best single-modal and multi-modal baselines, respectively.
Overall, MM-Path outperforms all baselines on these tasks across
4.1.4 Baselines. We compare the proposed model with 5 unsu- both datasets, demonstrating its superiority. Specifically, we can
pervised single-modal path pre-trained methods and 5 unsupervised make the following observations: The graph representation learning
multi-modal methods. The single-modal path pre-trained methods method Node2vec significantly underperforms compared to MM-
are: Path, primarily due to its focus solely on the topological information
• Node2vec [16]: This is an unsupervised model that learn node of nodes while overlooking the sequential information of paths.
representation based on a graph. Single-modal models like PIM, LightPath, and TracjCL show im-
• PIM [3]: This is an unsupervised path representation learning proved performance over Node2vec, indicating the importance of
approach based on mutual information maximization. capturing sequential correlations within paths. Among the single-
• LightPath [45]: This is a lightweight and scalable path repre- modal models, START achieves the best performance. It adeptly
sentation learning method. integrate sequential path information with spatio-temporal transition
• TrajCL [5]: This is a contrastive learning-based trajectory relationships derived from historical trajectory data. However, as
modeling method. a single-modal model, its capabilities are inherently constrained.
• START [21]: This is a self-supervised trajectory representa- As a multi-modal model, CLIP exhibits the weakest performance.
tion learning framework with temporal regularities and travel Designed primarily for general corpora, it focuses on single, coarse-
semantics. grained image representations, which often introduce noise into path
modeling. Consequently, CLIP struggles to effectively capture com-
The multi-modal methods are:
plex spatial information and correspondences, making it unsuitable
• CLIP [37]: This is a classic pre-trained multi-modal model. for modeling paths. USPM performs poorly because it analyzes in-
For each path, we use a single rectangular image for the im- dividual streets using images and road networks, rather than paths
age modality and replace the original text sequence with a (i.e., street sequences). As a result, it fails to effectively mine the
node sequence. After pre-training, we concatenate the repre- sequential relationships present in the two modalities. The variants
sentations of the two modalities and use them as input to the LightPath+image and START+image perform comparably to their
linear layer task head. single-modal models (i.e., LightPath and START), suggesting that
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xu et al.
Aalborg Xi’an
Methods Travel Time Estimation Path Ranking Travel Time Estimation Path Ranking
MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑ MAE ↓ MARE ↓ MAPE ↓ MAE ↓ 𝜏↑ 𝜌↑
Node2vec [16] 76.228 0.281 54.182 0.203 0.119 0.140 227.129 0.269 30.919 0.218 0.079 0.098
PIM [3] 63.812 0.237 47.054 0.144 0.284 0.343 207.266 0.246 27.716 0.207 0.091 0.102
Lightpath [45] 58.818 0.221 40.219 0.124 0.413 0.483 201.400 0.229 26.429 0.178 0.209 0.252
TrajCL [5] 53.822 0.208 34.239 0.113 0.499 0.577 202.757 0.238 26.506 0.181 0.211 0.256
START [21] 51.176 0.191 34.315 0.117 0.475 0.556 199.843 0.215 25.022 0.179 0.229 0.279
CLIP [37] 72.155 0.261 50.284 0.162 0.179 0.185 219.048 0.256 30.962 0.213 0.087 0.099
USPM [8] 66.714 0.249 51.916 0.148 0.308 0.383 205.594 0.244 26.039 0.209 0.105 0.110
JGRM [29] 51.251 0.193 32.380 0.115 0.512 0.592 201.010 0.228 26.400 0.177 0.228 0.262
Lightpath+image 59.698 0.224 40.920 0.131 0.383 0.405 205.556 0.242 27.058 0.182 0.188 0.231
START+image 51.859 0.188 33.401 0.122 0.437 0.521 200.059 0.211 26.046 0.184 0.183 0.226
MM-Path 47.756 0.172 29.808 0.106 0.558 0.643 187.452 0.193 23.644 0.165 0.257 0.294
Improvement 6.682% 9.947% 12.941% 6.194% 11.823% 11.443% 6.201% 10.236% 5.507% 7.303% 12.227% 5.376%
Improvement* 6.819% 8.511% 7.943% 7.826% 8.984% 8.614% 6.312% 8.531% 9.222% 6.780% 12.719% 12.213%
merely concatenating two modalities does not effectively enhance significantly enhances the model’s effectiveness. This conclusively
multi-modal fusion. Having adapted JGRM to integrate image paths validates that MM-Path optimally utilizes all designed components.
and road paths, JGRM outperforms other multi-modal baselines. It is
specifically designed for multi-modal integration and excels at merg- Table 3: Effect of variants of MM-Path in Aalborg
ing information from various sources. However, JGRM’s limitations Aalborg
in handling multi-modal information of varying granularities and its Methods Travel Time Estimation Path Ranking
lack of use of cross-modal context information to guide the fusion MAE MARE MAPE MAE 𝜏 𝜌
process make its performance less optimal compared to MM-Path. MM-Path-z 49.649 0.185 30.193 0.114 0.528 0.622
MM-Path-y 48.529 0.181 32.722 0.118 0.511 0.603
w/o alignment 52.832 0.201 36.251 0.131 0.300 0.379
4.2.2 Ablation Study. We design eight variants of MM-Path to
w/o fusion 51.237 0.192 30.529 0.115 0.476 0.560
verify the necessity of the components of our model: (1) MM-Path-
w/o GCN 48.651 0.183 33.371 0.111 0.532 0.619
z: This variant leverages the road fused embedding z as a generic
w/o fine 51.641 0.192 33.277 0.129 0.441 0.523
representation of the path. (2) MM-Path-y: This model utilizes the w/o medium 50.932 0.187 34.250 0.114 0.494 0.583
image fused embedding y as a generic representation of the path. w/o coarse 50.688 0.189 35.341 0.117 0.505 0.596
(3) w/o alignment: This version excludes the multi-granularity loss. MM-Path 47.756 0.172 29.808 0.106 0.558 0.643
(4) w/o fusion: This variant substitutes the graph-based residual
fusion component with average pooling of the encoded embeddings
from both modalities. (5) w/o GCN: This model replaces the GCN Table 4: Effect of variants of MM-Path in Xi’an
in the graph-based cross-modal residual fusion component with a
cross-attention mechanism. (6) w/o fine, (7) w/o medium, and (8) Xi’an
w/o coarse: These variants omit the fine-grained, medium-grained, Methods Travel Time Estimation Path Ranking
MAE MARE MAPE MAE 𝜏 𝜌
and coarse-grained loss, respectively.
MM-Path-z 194.301 0.231 24.455 0.183 0.199 0.241
The results are summarized in Tables 3 and 4. We can observe
MM-Path-y 196.331 0.233 24.747 0.196 0.178 0.214
that MM-Path w/o alignment shows poor performance, which is
w/o alignment 200.335 0.239 26.459 0.195 0.131 0.167
attributed to its reliance solely on multi-modal data fusion without
w/o fusion 200.652 0.239 25.433 0.208 0.113 0.130
considering multi-granularity alignment. The variants, w/o fine, w/o
w/o GCN 189.659 0.221 24.496 0.173 0.234 0.286
medium, and w/o coarse, outperform w/o alignment but still worse w/o fine 199.214 0.235 26.913 0.177 0.226 0.275
than the full MM-Path, demonstrating the importance of multiple w/o medium 192.514 0.229 24.826 0.175 0.227 0.278
granularity alignments. MM-Path w/o fusion also exhibits poor per- w/o coarse 194.256 0.230 25.757 0.176 0.231 0.278
formance, while MM-Path w/o GCN performs slightly worse than MM-Path 187.452 0.193 23.644 0.165 0.257 0.294
MM-Path. These results indicate that complex fusion methods with
cross-modal context information enhance path understanding. Both
MM-Path-y and MM-Path-z demonstrate comparable performance 4.2.3 Effect of Pre-training. In this section, we evaluate the ef-
in travel time estimation and path ranking. This indicates that dif- fect of pre-training. We vary the size of labeled data used for fine-
ferent modal perspectives contribute valuable insights for various tuning, and compare the performance of the proposed MM-Path
downstream tasks. The overall performance of MM-Path surpasses (Pre-trained) with its variant that lacks pre-training (No Pre-trained).
all variants. This result implies that each of the proposed components Figure 5 shows the performance of travel time estimation and path
MM-Path: Multi-modal, Multi-granularity Path Representation Learning—Extended Version Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
3 U H W U D L Q H G 3 U H W U D L Q H G 3 U H W U D L Q H G
1 R 3 U H W U D L Q H G 1 R 3 U H W U D L Q H G 1 R 3 U H W U D L Q H G
0 $ (
0 $ (
3 U H W U D L Q H G
1 R 3 U H W U D L Q H G
7 U D L Q L Q J '