0% found this document useful (0 votes)
138 views10 pages

CMCLRec Cross-Modal Contrastive Learning For User Cold-Start Sequential Recommendation

Uploaded by

hanzgnit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views10 pages

CMCLRec Cross-Modal Contrastive Learning For User Cold-Start Sequential Recommendation

Uploaded by

hanzgnit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CMCLRec: Cross-modal Contrastive Learning for User Cold-start

Sequential Recommendation
Xiaolong Xu Hongsheng Dong Lianyong Qi∗
[email protected] [email protected] [email protected]
Nanjing University of Information Nanjing University of Information China University of Petroleum (East
Science and Technology Science and Technology China)
Nanjing, Jiangsu, China Nanjing, Jiangsu, China Qingdao, Shandong, China

Xuyun Zhang∗ Haolong Xiang Xiaoyu Xia


[email protected] [email protected] [email protected]
Macquarie University Nanjing University of Information RMIT University
Sydney, New South Wales, Australia Science and Technology Melbourne, Victoria, Australia
Nanjing, Jiangsu, China

Yanwei Xu Wanchun Dou


[email protected] [email protected]
Tianjin University Nanjing University
Tianjin, China Nanjing, Jiangsu, China

ABSTRACT state-of-the-art baseline models, CMCLRec markedly enhances the


Sequential recommendation models generate embeddings for items performance of conventional sequential recommendation models,
through the analysis of historical user-item interactions and utilize particularly for cold-start users.
the acquired embeddings to predict user preferences. Despite being
effective in revealing personalized preferences for users, these mod- CCS CONCEPTS
els heavily rely on user-item interactions. However, due to the lack • Information systems → Recommender systems; • Comput-
of interaction information, new users face challenges when utilizing ing methodologies → Neural networks.
sequential recommendation models for predictions, which is recog-
nized as the cold-start problem. Recent studies, while addressing KEYWORDS
this problem within specific structures, often neglect the compat-
Sequential Recommendation, Cold-start, Cross-modal Contrastive
ibility with existing sequential recommendation models, making
Learning, Self-supervised Learning
seamless integration into existing models unfeasible. To address this
challenge, we propose CMCLRec, a Cross-Modal Contrastive Learn- ACM Reference Format:
ing framework for user cold-start RECommendation. This approach Xiaolong Xu, Hongsheng Dong, Lianyong Qi, Xuyun Zhang, Haolong Xi-
aims to solve the user cold-start problem by customizing inputs ang, Xiaoyu Xia, Yanwei Xu, and Wanchun Dou. 2024. CMCLRec: Cross-
for cold-start users that align with the requirements of sequen- modal Contrastive Learning for User Cold-start Sequential Recommen-
tial recommendation models in a cross-modal manner. Specifically, dation. In Proceedings of the 47th International ACM SIGIR Conference on
CMCLRec adopts cross-modal contrastive learning to construct Research and Development in Information Retrieval (SIGIR ’24), July 14–18,
a mapping from user features to user-item interactions based on 2024, Washington, DC, USA. ACM, New York, NY, USA, 10 pages. https:
//doi.org/10.1145/3626772.3657839
warm user data. It then generates a simulated behavior sequence
for each cold-start user in turn for recommendation purposes. In
this way, CMCLRec is theoretically compatible with any extant 1 INTRODUCTION
sequential recommendation model. Comprehensive experiments
conducted on real-world datasets substantiate that, compared with Recommender systems have been extensively applied across di-
verse online and mobile platforms, including but not limited to
∗ Corresponding author. e-commerce, music streaming, and social media platforms [37]. In
such platforms, user behavior evolves over time [2], and the number
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed of items that typical users interact with usually represents only
for profit or commercial advantage and that copies bear this notice and the full citation 1%-2% of the total items. This results in a highly sparse user-item
on the first page. Copyrights for components of this work owned by others than the interaction matrix, posing significant limitations on traditional rec-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission ommendation algorithms, such as collaborative filtering [19] and
and/or a fee. Request permissions from [email protected]. two-tower models [36].
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Recently, the rapid development of deep learning [4] has engen-
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0431-4/24/07 dered substantial investigation into embedding-based sequential
https://doi.org/10.1145/3626772.3657839 recommender systems [6]. These systems have been widely adopted

1589
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xiaolong Xu et al.

in the industry, benefiting from their proficiency in accurately cap- an excessive amount of extra training data. For instance, Chen et
turing the dynamic behaviors of users and providing high-quality al. [3] address the cold-start problem using generative adversarial
recommendations. Sequential recommender systems typically cap- networks to reduce the difference between cold and warm item
ture sequential information from user-item interaction sequences to embeddings. Lee et al. [12] employ meta-learning to estimate cold-
forecast the next potential item with which the user is predisposed start user preference. However, such methods are constrained by
to interact. However, despite the diversity of models, sequential the need for additional training data, increasing the costs associated
recommender systems exhibit a strong dependency on user-item with data collection and processing accordingly. Interestingly, in
interaction sequences due to fixed recommendation patterns. This [34], self-supervised learning is introduced into the recommender
leads to suboptimal performance when recommending to new users, system. It directly masks Wikipedia for training data and utilizes
a challenge commonly known as the user cold-start problem. contrastive learning techniques, without the need for extra prepa-
To tackle this problem, various methods have been investigated, ration.
such as the content-based cold-start recommendation [30] and Considering the preceding discourse, we introduce a Cross-
Dropout [24]. For content-based cold-start recommendation, it rec- Modal Contrastive Learning framework for sequential RECom-
ommends items by analyzing item content information, as well as mendation (CMCLRec) to address the user cold-start problem in
user personal characteristics and preferences. In this case, it can this paper. The fundamental concept underlying CMCLRec is to
better understand user interests, especially in scenarios involving utilize the cross-modal contrastive learning method to establish a
new users or a lack of user historical behavioral data. Dropout is an- mapping between user features and user-item interactions. This
other popular mechanism. During training, it randomly drops some mapping is employed to generate simulated behavior sequences for
neurons and user feature information to reduce the model’s reliance cold-start users, and advanced sequential recommendation models
on historical interaction data, which helps enhance the model’s subsequently utilize the generated sequences to provide recom-
generalization ability and effectively improves its cold-start per- mendations for cold-start users. The model is composed of three
formance. This approach encourages cold-start recommendations main modules: the data augmentation module, the cross-modal con-
based on alternative content information, mitigating the impact of trastive learning module, and the sequential recommendation mod-
suboptimal ID embeddings. Additionally, there are models based ule. The data augmentation module employs contrastive learning
on meta-learning [25], active learning [39], and other methods to to enhance user features and user-item interaction sequences, en-
address the cold-start problem. couraging the model to bring embeddings of similar users closer in
However, the above-mentioned methods face a common chal- the embedding space to extract more enriched hidden features. The
lenge in that they struggle to be integrated into these sequential cross-modal contrastive learning module utilizes auto-encoder tech-
recommendation models where specialized structural requirements niques to map user features and user-item interaction sequences
for sequential recommendation models are needed to address spe- into the same embedding space. It learns the mapping from user fea-
cific data distributions [33]. This problem stems from substantial tures to user-item interaction sequences based on warm user data,
differences in the model structures, feature representations, and enabling the construction of simulated behavior sequences for cold-
parameter settings of various cold-start algorithms compared to start users. The sequential recommendation module is agnostic to
established sequential recommendation models. Besides, several the underlying model and can be instantiated using any embedding-
approaches concentrate extensively on enhancing cold-start perfor- based sequential recommendation model. Our framework preserves
mance, frequently overlooking recommendations for regular users. the advanced overall recommendation performance by leveraging
Although state-of-the-art sequential recommendation models, such the advantages of cutting-edge sequential recommendation mod-
as SASRec [11] and MCLRec [16], exhibit suboptimal performance els while augmenting their effectiveness in cold-start scenarios. In
in recommending to cold-start users, they perform well with warm addition, the initial two modules adopt self-supervised learning,
users. If a cold-start module could be seamlessly integrated, the eliminating the need for additional labeled data preparation.
performance is expected to show a significant improvement. To The main contributions of our work are as follows:
be compatible with existing sequential recommendation models,
recommending to cold-start users requires generating simulated • We design a novel framework, named Cross-Modal Con-
behavior sequences as model inputs. However, due to the substan- trastive Learning Recommendation (CMCLRec), to mitigate
tial distribution gap between user features and behavior sequences, the user cold-start problem in recommender systems. CM-
an effective alignment technique is urgently needed to tackle the CLRec generates simulated behavior sequences based on
mapping issue. user features for cold-start users, facilitating their incorpora-
Recently, cross-modal contrastive learning [38], capable of gen- tion into any sequential recommendation model to enhance
erating mappings between multiple modalities, has garnered wide- the performance in cold-start scenarios.
spread attention in various fields, such as computer vision [42], • We employ self-supervised training in CMCLRec to enhance
natural language processing [35], etc. This approach reinforces se- the recommendation performance of the model for cold-start
mantic relationships by learning shared embedding representations users without requiring supplementary label data.
of diverse modal data and facilitating closer proximity of similar • We conduct experiments based on two publicly available
content in this space. This makes it possible to generate simulated datasets. The results illustrate that CMCLRec outperforms
behavioral sequences. the most competitive model across all scenarios, and the
Furthermore, the embedded modules should not impose addi- ablation study further confirms the effectiveness of the cross-
tional training burdens on the overall model and should not demand modal construction of simulated behavior sequences.

1590
CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation SIGIR ’24, July 14–18, 2024, Washington, DC, USA

2 RELATED WORK scarce data is a crucial challenge. Vartak et al. [23] addressed cold-
start items on Twitter by training a classifier on items that users
2.1 Sequential Recommendation
have interacted with and then using this classifier to determine
Sequential recommendation, as investigated in previous studies whether a user is interested in a cold-start item. MetaTL [27] adopts
[6, 28], utilizes user-item interactions to formulate embeddings a sequential recommendation model, utilizing few-shot learning to
for users. It forecasts the subsequent item that is most likely to recommend to cold-start users, and leveraging meta-learning to en-
be interacted with by the user. This paradigm has undergone ex- hance the accuracy of recommendations. MML[15] integrates side
tensive scrutiny and practical application in both academic and information of items into the meta-learning process to improve the
industrial contexts. Traditional Sequential recommendation adopts recommendation effect of cold-start items. MeLU (Meta-Learned
Markov chain models [8, 17], which have significant advantages in User Preference) [12] employs the Model-Agnostic Meta-Learning
modeling user-item interaction in a sequence. However, Markov (MAML) algorithm for the purpose of meta-learning a shared set
properties can only capture short-term and point dependencies, of initialization parameters. For each cold-start user, MeLU fine-
making them less suitable for real-world scenarios. Amidst the rapid tunes the initialized model using the limited user-item interaction
evolution of deep learning, neural networks have progressively data to obtain a user-customized model for recommending items to
been integrated into recommender systems to address the limita- cold-start users.
tions of traditional algorithms. GRU4Rec [9] employs Recurrent Recently, contrastive learning [29] has been widely applied,
Neural Networks (RNN) to predict the next possible interaction by achieving unprecedented success and providing an alternative ap-
capturing the sequential relationships within a given user-item in- proach to addressing the cold-start problem. CLCRec [31] max-
teraction sequence, introducing positional information of items into imizes the dependence between item content and collaborative
the model. Caser [20], inspired by the computer vision field, utilizes signals based on a contrastive learning objective function, enabling
Convolutional Neural Networks (CNN) with a focus on short-term the model to retain interactive information in the content repre-
behavioral preferences that have a more significant impact on users. sentation of cold-start items. CPKSPA [13] introduces an effective
SR-GNN [32], rooted in Graph Neural Networks, conceptualizes combination of a rating prediction module, embedding distribution
interactions as nodes within a graph, subsequently mapping each alignment module, and contrastive augmentation module to re-
sequence to paths in the graph. Ultimately, it acquires embeddings duce differences between potential embedding distributions across
for users or items within the graph. The advent of the Transformer domains. This results in more stable and robust embeddings for cold-
architecture has elevated the self-attention mechanism to a main- start items. Socially-aware dual contrastive learning [5] introduces
stream approach in recommender systems. SASRec [11] introduces an approach that integrates user-user relationships, user-item in-
the Transformer into sequential recommendation, employing the teractions, and item-item similarity to adapt representations within
self-attention mechanism to model user-item interaction sequences a semi-supervised environment. Cold-start users leverage their so-
and extract more valuable features. cial relationships for modeling warm users without necessitating
additional user-item interaction records.
However, the aforementioned studies face challenges in seamless
2.2 Cold-start Recommendation integration with existing efficient recommendation models, thereby
Despite the substantial success achieved by embedding-based rec- significantly compromising their flexibility. Moreover, these studies
ommendation models in the realm of recommender systems, they have not fully harnessed the implicit relationships between user-
encounter challenges in delivering accurate recommendations for item interactions and user features.
cold-start users devoid of user-item interaction sequences. This lim-
itation contributes to a notable decline in user retention rates. Effi-
ciently utilizing side information such as attribute features, knowl-
3 METHOD
edge graphs, and auxiliary domains becomes a common solution 3.1 Overview
for cold-start scenarios without user-item interaction sequences. In this section, we introduce the CMCLRec framework to enhance
DropoutNet [24] introduces a dropout mechanism during training, conventional sequential recommendation algorithms to achieve
significantly reducing the model’s dependence on ID embedding improved accuracy in recommending items for cold-start users. CM-
and enhancing the weights of other content features. This approach CLRec consists of three modules, including the data augmentation
allows cold-start users to be recommended mainly based on other module, cross-module contrastive learning module, and sequential
content features, mitigating the impact of poor ID embeddings. recommendation module.
MetaEmbedding [14] leverages item features, excluding ID, and in-
corporates a generator network to produce the initialization values
for ID embeddings. In the case of cold-start items, the generator 3.2 Problem Formulation
forecasts their initial ID embeddings, and subsequent training and This study aims to generate simulated behavioral sequences for
recommendations are executed based on these embeddings. MWUF cold-start users, for whom interaction sequences are unavailable,
[40] generates scaling and shifting functions from item features relying on their feature information. This facilitates the seamless in-
using meta-learning, which are employed to transform features for tegration of CMCLRec into pre-existing sequential recommendation
cold-start items, mapping them to another feature space to enhance models without compromising the original model’s effectiveness
prediction accuracy. In the context of the cold-start scenario with for warm users. Furthermore, the incorporation of self-supervised
limited user-item interaction sequences, efficiently utilizing the learning and contrastive learning into the framework is executed

1591
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xiaolong Xu et al.

Cross-modal Contrastive Learning Module


Worm user sequential
recommendation
s
model

Encoder Decoder
Self-attention attention embedding Tsuk psuk generated embedding Tˆsuk
Es Ds

cold-start user
cross generated embedding T u, k
s

Encoder Decoder
Self-attention attention embedding T fuk p ufk generated embedding Tˆfuk
Ef Df

f
SeqRec Module

h ufk1 f
hufk2 h uf k hsuk hsu2k s hsu1k
c c

Encoder Share Encoder Share Encoder Encoder Share Encoder Share Encoder
f parameter f parameter f s parameter s parameter s
uk uk uk uk
F1 F 2 S 2 S1

F uk S uk

Data Augmentation Module

Figure 1: The overall architecture of CMCLRec.

without a concomitant escalation in overall training complexity or in the data augmentation module depicted in Figure 1. It incentivizes
necessitating additional labeled data. the model to constrict the proximity of embeddings corresponding
Let 𝑈 denote the user set, 𝐼 denote the item set, and 𝑢 ∈ 𝑈 denote to similar users within the embedding space, while concurrently
a user. The historical interaction sequence for user 𝑢 is denoted amplifying the separation between embeddings associated with
as 𝑆 𝑢 = {𝑖𝑢1 , 𝑖𝑢2 . . . 𝑖𝑛𝑢 }, where 𝑛 is the length of the interaction se- dissimilar users.
quence, and 𝑖𝑡𝑢 ∈ 𝐼 (1 <= 𝑡 <= 𝑛) represents the item at position 𝑡 Given a batch size of 𝑁 , for each user 𝑢𝑘 (1 ≤ 𝑘 ≤ 𝑁 ), the
interacted with by user 𝑢. Additionally, user 𝑢 is associated with feature 𝐹 𝑢𝑘 is subjected to different data augmentation methods,
feature information 𝐹 𝑢 . The goal of our framework is to use the fea- resulting in partially masked features 𝐹 1𝑢𝑘 and 𝐹 2𝑢𝑘 . It is ensured
ture information 𝐹 𝑢 of a cold-start user to derive an embedding for that the features masked in 𝐹 1𝑢𝑘 are inconsistent with those masked
the interaction sequences. This derived embedding is then utilized in 𝐹 2𝑢𝑘 , where the data augmentation function is denoted as 𝛷. The
to generate recommendations for the cold-start user. augmented feature representation is then presented by:
Given that 𝐹 𝑢 may not be sufficient, utilizing these features
to infer behavior sequences proves challenging. Therefore, a con- 𝐹 1𝑢𝑘 = 𝜑 𝑓 1 (𝐹 𝑢𝑘 ), 𝐹 2𝑢𝑘 = 𝜑 𝑓 2 (𝐹 𝑢𝑘 ), 𝑠.𝑡 . 𝜑 𝑓 1, 𝜑 𝑓 2 ∈ 𝛷, (1)
trastive learning approach is initially employed to enhance 𝐹 𝑢 for
where both 𝜑 1 and 𝜑 2 are distinct data augmentation functions, and
𝑢 ∈ 𝑈 , aiming to acquire more comprehensive user features. Us-
𝐹 1𝑢𝑘 and 𝐹 2𝑢𝑘 represent the different feature embeddings generated
ing cross-modal contrastive learning, we construct a cross-modal
from 𝐹 𝑢𝑘 through these two functions, respectively.
mapping from 𝐹 𝑢 to 𝑆 𝑢 and generate simulated behavior sequences
Based on the enhanced 𝐹 1𝑢𝑘 and 𝐹 2𝑢𝑘 , an encoder, denoted by
for cold-start users. For warm users, direct recommendations are
Ψ𝑓 (·), is applied with a similar structure. This results in the en-
made using the enhanced 𝑆 𝑢 ; for cold-start users, the generated
coded embeddings ℎ𝑢𝑓 𝑘1 and ℎ𝑢𝑓 𝑘2 , ensuring parameter consistency
simulated behavior sequences are used to predict their preferences.
throughout the training process. In this way, the encoded embed-
dings ℎ𝑢𝑓 𝑘1 and ℎ𝑢𝑓 𝑘2 can be expressed as:
3.3 Data Augmentation Module
Owing to the incompleteness of registration information for a con- ℎ𝑢𝑓 𝑘1 = Ψ𝑓 (𝐹 1𝑢𝑘 ), ℎ𝑢𝑓 𝑘2 = Ψ𝑓 (𝐹 2𝑢𝑘 ), (2)
siderable number of users and the inadequate richness of feature
content, the data augmentation model employs a contrastive learn- where 𝐹 1𝑢𝑘 and 𝐹 2𝑢𝑘 can both be regarded as containing partial
ing methodology to augment the features of the data, as illustrated information from 𝐹 𝑢𝑘 .

1592
CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation SIGIR ’24, July 14–18, 2024, Washington, DC, USA

The purpose of the data augmentation model is to expand user behavior sequence as an example, for ℎ𝑢𝑠 𝑘 = {𝑎 1, 𝑎 2, . . . , 𝑎𝑛 }, where
features as much as possible, which can be analogized as restoring 𝑎𝑖 (1 ≤ 𝑖 ≤ 𝑛) is a column vector representing the embedding of
𝐹 𝑢𝑘 from 𝐹 1𝑢𝑘 and 𝐹 2𝑢𝑘 . Therefore, for their embeddings ℎ𝑢𝑓 𝑘1 and the corresponding item at position 𝑖. Introducing transformation
ℎ𝑢𝑓 𝑘2 , it is necessary to minimize the distribution gap between each matrices 𝑀1 , 𝑀2 , and 𝑀3 , subject to parameter updates through
learning, the transformation for 𝑎𝑖 is as follows:
embedding and its corresponding feature, while simultaneously
maximizing the distribution gap between embeddings of different 𝜎1𝑖 = 𝑀1𝑎𝑖 , 𝜎2𝑖 = 𝑀2𝑎𝑖 , 𝜎3𝑖 = 𝑀3𝑎𝑖 , (5)
users. In detail, we employ cosine similarity to represent the similar- 𝑗
where 𝜎1𝑖 and 𝜎2 are used to compute the similarity between 𝑎𝑖 and
ity between embeddings, aiming to minimize the similarity among 𝑖
similar users (as shown in the numerator of Eq. (3)) while maxi- 𝑎 𝑗 , while 𝜎3 encapsulates the original information of 𝑎𝑖 .
mizing the dissimilarity among dissimilar users (as shown in the Subsequently, the similarity between 𝑎𝑖 and 𝑎 𝑗 , denoted as 𝛼𝑖.𝑗 ,
𝑓 is calculated using the following formula:
denominator of Eq. (3)). The contrastive loss ℒ𝑐 can be calculated
𝑗 √
 
as:   exp 𝜎1𝑖 · 𝜎2 / 𝑑
𝑁 𝑢𝑘 𝑢𝑘
© exp 𝑠 (ℎ 𝑓 1 , ℎ 𝑓 2 )/𝜏 𝛼𝑖,𝑗 = Í √ , (6)
1 ∑︁

𝑓 ª 𝑛 exp 𝜎 𝑖 · 𝜎 𝑘 / 𝑑
ℒ𝑐 = − log ­­ Í   ®, (3) 𝑘=1 1 2
𝑁 𝑁 exp 𝑠 (ℎ𝑢𝑘 , ℎ𝑢𝑡 )/𝜏 ®
𝑘=1 𝑡 =1 𝑓 1 𝑓 2 𝑗
« ¬ where 𝑑 represents the dimensionality of 𝜎1𝑖 and 𝜎2 . Since the ob-
where the function 𝑠 (·) represents the similarity between two vec- tained 𝛼𝑖,𝑗 after the inner product tends to increase with the dimen-
tors, serving to quantify the distribution gap between them, and sionality, normalization is required to prevent unnecessary errors
𝜏 is a pre-defined hyperparameter that governs the model’s dis- caused by dimensionality.
criminative capacity concerning negative samples. A too-large 𝜏 Through the self-attention operation, ℎ𝑢𝑠 𝑘 is transformed into
value may result in insufficient discrimination between positive 𝑢𝑘
𝑇𝑠 = {𝑏 1, 𝑏 2, . . . , 𝑏𝑛 }, where 𝑏𝑖 (1 ≤ 𝑖 ≤ 𝑛) is represented as:
and negative samples, thereby contributing to suboptimal model 𝑛
performance. Conversely, a too-small 𝜏 value may cause the model
∑︁
𝑏𝑖 = 𝛼𝑖,𝑗 · 𝜎3𝑖 . (7)
to overly focus on negative samples, making it challenging for the 𝑗=1
model to converge. It is evident that by minimizing the contrastive
loss, the distances between positive samples can be reduced while Evidently, each 𝑏 encompasses all the information from 𝑎. Fur-
simultaneously increasing the distances between negative samples. thermore, in this layer, there are only three transformation matrices:
Subsequently, the same methodology is applied to process user 𝑀1 , 𝑀2 , and 𝑀3 . Without significantly increasing the training com-
behavior sequences. For a user 𝑢𝑘 (1 ≤ 𝑘 ≤ 𝑁 ) and its behav- plexity, a weighted approach is applied to the distinct features of
ior sequence 𝑆 𝑢𝑘 , the final encoded embeddings ℎ𝑢𝑠1𝑘 and ℎ𝑢𝑠2𝑘 are items, allowing for the collection of combined information among
generated. The contrastive loss can be calculated with: various items within ℎ𝑢𝑠 𝑘 . This significantly reduces the subsequent
  challenges in modal fusion. Similarly, for ℎ𝑢𝑓 𝑘 , it undergoes a self-
1
𝑁
∑︁ © exp 𝑠 (ℎ𝑢𝑠1𝑘 , ℎ𝑢𝑠2𝑘 )/𝜏 ª attention operation and transforms into 𝑇𝑓𝑢𝑘 = {𝑏 1, 𝑏 2, . . . , 𝑏𝑛 }.
ℒ𝑐𝑠 = − log ­­ Í ®. (4)
Based on the 𝑇𝑠𝑢𝑘 and 𝑇𝑓𝑢𝑘 generated by self-attention, a map-

𝑁 𝑁 exp 𝑠 (ℎ𝑢𝑘 , ℎ𝑢𝑡 )/𝜏 ®
𝑘=1 𝑡 =1 𝑠1 𝑠2
« ¬ ping from user features to behavior sequences is constructed using
Due to the powerful capabilities of contrastive learning and autoencoders and cross-modal learning methods. 𝐸𝑠 and 𝐸 𝑓 de-
the relatively uncomplicated nature of user feature content, a ba- note the encoders for 𝑇𝑠𝑢𝑘 and 𝑇𝑓𝑢𝑘 , respectively, while 𝐷𝑠 and 𝐷 𝑓
sic Multi-Layer Perceptron (MLP) is utilized as the encoder for
contrasting user features. However, a basic MLP cannot handle represent their corresponding decoders. Let 𝑇b𝑠𝑢𝑘 = 𝐷𝑠 (𝐸𝑠 (𝑇𝑠𝑢𝑘 )),
more complex user behavior sequences. Here, we utilize a Trans- 𝑇b𝑢𝑘 = 𝐷 𝑓 (𝐸 𝑓 (𝑇 𝑢𝑘 )), ensuring that the encoded-decoded results
𝑓 𝑓
former Encoder with enhanced expressive capacity for encoding. preserve the information from 𝑇𝑠𝑢𝑘 and 𝑇𝑓𝑢𝑘 as much as possible.
The incorporation of this amalgamation of contrastive learning and The autoencoder loss can be expressed as:
self-supervised learning facilitates the extraction of more compre- ∑︁
hensive information features from both user features and behavior ℒ𝑠 = ||𝑇b𝑠𝑢𝑘 − 𝑇𝑠𝑢𝑘 || 22,
sequences, obviating the necessity for additional data. This stream- 𝑢𝑘 ∈𝑈𝑏
∑︁ (8)
lines the subsequent implementation of cross-modal fusion. ℒ𝑓 = ||𝑇b𝑓𝑢𝑘 − 𝑇𝑓𝑢𝑘 || 22 .
𝑢𝑘 ∈𝑈𝑏
3.4 Cross-modal Contrastive Learning Module Let𝑝𝑠𝑢𝑘 𝑢𝑘 𝑢𝑘
𝐸𝑠 (𝑇𝑠 ), 𝑝 𝑓 = 𝐸 𝑓
= (𝑇𝑓𝑢𝑘 ). This establishes the foun-
User features 𝐹 𝑢𝑘 and behavior sequences 𝑆 𝑢𝑘 are directly input into
dation for the implementation of cross-modal learning methods.
the data augmentation model, enhancing user feature embeddings
In this context, the model is trained to acquire the mapping from
denoted as ℎ𝑢𝑓 𝑘 and ℎ𝑢𝑠 𝑘 .
𝑝𝑢𝑓 𝑘 to 𝑇b𝑠𝑢𝑘 while concurrently minimizing the distribution gap be-
Conventional deep networks struggle to attend to the entire
sequence information effectively, and they may not fully utilize tween 𝑝𝑠𝑢𝑘 and 𝑝𝑢𝑓 𝑘 . This enables both vectors to exhibit the capacity
the comprehensive information within ℎ𝑢𝑓 𝑘 and ℎ𝑢𝑠 𝑘 along with for predicting 𝑇b𝑠𝑢𝑘 . Despite the significant distribution difference
their implicit combined information. Thus, a self-attention mech- between 𝑝𝑠𝑢𝑘 and 𝑝𝑢𝑓 𝑘 , there exists a correlation between them as
anism is employed for embedding construction. Taking the user both are generated by the user 𝑢𝑘 . Therefore, adopting a transfer

1593
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xiaolong Xu et al.

learning-like approach, coupled with the Max Mean Discrepancy As the recommendation approaches for warm users and cold-
(MMD) loss function, facilitates cross-modal learning by minimiz- start users differ in the model, 𝑓 𝑙𝑎𝑔 ∈ {0, 1} is introduced to distin-
ing the distribution gap between 𝑝𝑠𝑢𝑘 and 𝑝𝑢𝑓 𝑘 . The loss function guish between them:
ℒ𝑐𝑟𝑜𝑠𝑠 is calculated as follows: (
0, warm user
  ∑︁ 2 𝑓 𝑙𝑎𝑔 = . (12)
ℒ𝑐𝑟𝑜𝑠𝑠 = MMD H 𝑃𝑠 , 𝑃 𝑓 + 𝑇b𝑠𝑢𝑘 − 𝑇𝑓𝑢𝑘 1, cold-start user
2
𝑢𝑘 ∈𝑈𝑏
Regarding the input to the recommendation network, it can
𝑁 𝑁 2
1 ∑︁ 1 ∑︁ ∑︁ be differentiated based on the 𝑓 𝑙𝑎𝑔. For a batch of size 𝑁 , where
= 𝜙 (𝑝𝑠𝑢𝑖 ) − 𝜙 (𝑝𝑢𝑓 𝑖 ) + 𝑇b𝑠𝑢𝑘 − 𝑇𝑓𝑢𝑘
𝑁 𝑖=1 𝑁 𝑖=1 2 the users within the batch are denoted as 𝑈𝑏 = {𝑢 1, 𝑢 2 . . . , 𝑢 𝑁 },
𝑢𝑘 ∈𝑈𝑏
the behavior sequence of a warm user 𝑢𝑘 can be represented as
𝑁 𝑁
1 ∑︁ ∑︁ 𝑢𝑖 𝑢 𝑗
𝑁 𝑁
2 ∑︁ ∑︁ 𝑢 𝑇𝑠𝑢𝑘 = {𝑖𝑢1 𝑘 , 𝑖𝑢2 𝑘 . . . , 𝑖𝑛𝑢𝑘 }. The simulated behavior sequence for a
= 𝓀(𝑝 , 𝑝 ) − 𝓀(𝑝𝑠𝑢𝑖 , 𝑝 𝑓 𝑗 ) 𝑢 𝑢 𝑢 𝑢
𝑁 2 𝑖=1 𝑗=1
𝑠 𝑠
𝑁 2 𝑖=1 𝑗=1 cold-start user 𝑢 𝑗 can be represented as 𝑇b𝑠 ′ 𝑗 = {𝑖 1 𝑗 , 𝑖 2 𝑗 . . . , 𝑖𝑛 𝑗 }. For
the input 𝑋 to the network, it can be expressed as:
𝑁 𝑁 2
1 ∑︁ ∑︁ 𝑢
∑︁
𝑋 𝑢𝑘 = 𝑓 𝑙𝑎𝑔 · 𝑇b𝑠𝑢′𝑘 + (1 − 𝑓 𝑙𝑎𝑔) · 𝑇𝑠𝑢𝑘 . (13)
+ 2
𝓀(𝑝𝑢𝑓 𝑖 , 𝑝 𝑓 𝑗 ) + 𝑇b𝑠𝑢𝑘 − 𝑇𝑓𝑢𝑘 .
𝑁 𝑖=1 𝑗=1 2
𝑢𝑘 ∈𝑈𝑏
Following the recommendation model, its output is denoted as
(9)
𝐺 𝑢𝑘 = R(𝑋 𝑢𝑘 ), where 𝐺 𝑢𝑘 = {𝑔𝑢1 𝑘 , 𝑔𝑢2 𝑘 , . . . , 𝑔𝑢𝑛𝑘 }. The loss function
For a batch, 𝑃𝑠 = {𝑝𝑠𝑢1 , 𝑝𝑠𝑢2 , . . . , 𝑝𝑠𝑢 𝑁 }, 𝑃 𝑓 = {𝑝𝑢𝑓 1 , 𝑝𝑢𝑓 2 , . . . , 𝑝𝑢𝑓 𝑁 }, can be expressed as:
where 𝑁 denotes the batch size. Let 𝑘 denote the Gaussian kernel
( !)
1 ∑︁ 𝑢𝑘
∑︁
function, 𝜙 : 𝑥 → H denote the feature mapping, and H denote the ℒ𝑟𝑒𝑐 = − S(𝑔𝑛 ) − ln exp (S(𝑖𝑒 )) , (14)
𝑁
reproducing kernel Hilbert space corresponding to 𝓀. The kernel 𝑢𝑘 ∈𝑈𝑏 𝑖 ∈𝐼
function 𝓀 can be expressed as: where 𝑖𝑒 denotes the embedding of item 𝑖, and S(·) signifies the

(𝑥 − 𝑦) 2
 softmax function. The optimization of this cross-entropy loss func-
𝓀(𝑥, 𝑦) = exp − , (10) tion is directed towards maximizing the probability of accurate
2𝜎 2
predictions.
where the parameter 𝜎 regulates the range of the Gaussian kernel
function with an increased value signifying a more extensive local 3.6 Training Strategy
impact range for the Gaussian kernel function.
Within the data augmentation and cross-modal contrastive learning
The overall loss function for this module is as follows:
modules, CMCLRec aims to explore latent user features and extract
ℒ𝑐2𝑙 = ℒ𝑠 + ℒ𝑓 + 𝜆ℒ𝑐𝑟𝑜𝑠𝑠 , (11) the mapping relationship between user features and user-item in-
teractions. These modules operate independently of the sequential
where 𝜆 represents the weight of cross-modal learning, and an exces- recommendation module. Additionally, the first two modules em-
sively large value can impede model convergence, while an overly ploy self-supervised training, obviating the need for labeled data,
small value can result in poorer reconstruction ability, insufficient rendering them apt for independent large-scale training. Conse-
information in the simulated behavior sequence, and suboptimal quently, the training of the comprehensive framework is conducted
recommendation performance. in two distinct stages as shown in Algorithm 1.
Following the completion of cross-modal learning, 𝑝𝑠𝑢𝑘 and 𝑝𝑢𝑓 𝑘 In the first stage, pre-training is conducted for the initial two
are similar in terms of distributions and possess predictive capabil- modules, updating the parameters of the encoders 𝛹𝑓 and 𝛹𝑠 using
ities for 𝑇b𝑠𝑢𝑘 . Therefore, for warm users, direct recommendations the contrastive loss ℒ𝑐𝑠 (Eq. (3)) and ℒ𝑐𝑠 (Eq. (4)). Regarding ℒ2 in
can be made using the behavior sequence 𝑇𝑠𝑢𝑘 . For cold-start users, Eq. (11), its update is performed in conjunction with the contrastive
𝑇𝑓𝑢𝑘 are used to calculate 𝑝𝑢𝑓 𝑘 through encoder 𝐸 𝑓 , and the simu- loss functions due to its dependence on feature enhancement and
strong correlation with the data augmentation module. This stage’s
lated behavior sequence 𝑇b𝑠𝑢′𝑘 is obtained through decoder 𝐷𝑠 for
overall loss function can be expressed as follows:
user recommendations.
𝑓
3.5 Sequential Recommendation Module ℒ𝑝𝑟𝑒 = 𝛼ℒ𝑐 + 𝛽ℒ𝑐𝑠 + ℒ𝑐2𝑙 , (15)
After the cross-modal learning module, the model has acquired a where the parameters 𝛼 and 𝛽 serve to adjust the magnitude of
mapping from user features to behavior sequences. Therefore, the enhancement for user features and user-item interaction. In partic-
next step involves making recommendations to users. The method ular, training data in this stage is exclusively sourced from warm
we propose does not impose specific requirements on the imple- users.
mentation of the sequential recommendation model. In theory, any In the second stage, the initial two modules are incorporated
existing sequential recommendation model can be employed, en- into the sequential recommendation module and are subjected to
hancing its recommendation performance in cold-start scenarios. fine-tuning. This phase primarily enhances the model’s recommen-
The sequential recommendation model is represented by the func- dation capabilities, utilizing 𝑇𝑠𝑢𝑘 for warm user input, as discussed
tion R(·). in Section 3.4, and 𝑇b𝑠𝑢′𝑘 for cold-start user input. The overall loss

1594
CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation SIGIR ’24, July 14–18, 2024, Washington, DC, USA

function for this stage can be expressed as: • RQ2: Whether CMCLRec reduces the distribution gap be-
ℒ𝑓 𝑖𝑛𝑒 = 𝜂ℒ𝑝𝑟𝑒 + ℒ𝑟𝑒𝑐 , (16) tween cold-start and warm users during the self-supervised
learning phase?
where the parameter 𝜂 is utilized to regulate the fine-tuning mag- • RQ3: Can integrating CMCLRec into a regular sequential rec-
nitude for the first two modules. An excessively large parameter ommendation model effectively improve recommendation
may lead to overfitting issues, diminishing the model’s general- performance for cold-start users, and what are the effects of
ization capability. Conversely, an overly small parameter might different key components?
result in insufficient expressive power, causing the simulated se-
quences to inadequately fit cold-start users and thereby reducing
recommendation performance. 4.1 Experimental Setup
The comprehensive training procedure is delineated in Algo- Datasets. We conduct experiments to evaluate CMCLRec’s per-
rithm 1. formance on two publicly available datasets: KuaiRec [7] and XING
[1]. KuaiRec constitutes a real-world dataset sourced from recom-
Algorithm 1 CMCLRec mendation logs within the mobile video-sharing application, i.e.,
Input: 𝐹 𝑢 and 𝑆 𝑢 for 𝑢 ∈ 𝑈 (𝑓 𝑙𝑎𝑔 = 0), 𝐹 𝑢 for 𝑢 ∈ 𝑈 (𝑓 𝑙𝑎𝑔 = 1), Kuaishou, containing 1141 users and 3327 items with 4,676,570 user-
learning rate 𝑙𝑟 , hyperparameters 𝛼, 𝛽, 𝜆, 𝜂. item interactions. XING is a subset derived from the ACM RecSys
Output: Global model parameters 𝒲 (composed 𝒲𝑝𝑟𝑒 of and 2017 challenge dataset, comprising 106,881 users, 20,519 jobs, and
𝒲𝑟𝑒𝑐 ). 4,306,183 interactions. A 2,738-dimensional vector is employed to
1: Remove user-item interaction of some users to simulate cold represent the job content, capturing diverse attributes including
start users. career level, tags, and supplementary information. 30% of users are
2: Set 𝑓 𝑙𝑎𝑔 for all users by Eq. (12). partitioned into the test set, with 15% undergoing no adjustments
3: /* Self-supervised learning stage. */ and the remaining 15% having their user-item interactions removed
4: for 𝑖 ← 1 : 𝐸 1 (number of pre-train epochs) do to simulate cold-start users for each dataset. The division of the
5: for 𝑗 ← 1 : 𝐵 1 (number of pre-train batch size) do remaining users into training and validation sets adheres to an 8:2
𝑓 ratio.
6: Calculate ℒ𝑐 for users (𝑓 𝑙𝑎𝑔 = 0) by Eq. (3).
𝑠 Evaluation Metrics. We conduct separate evaluations for the
7: Calculate ℒ𝑐 for users (𝑓 𝑙𝑎𝑔 = 0) by Eq. (4).
overall performance, warm recommendation performance, and cold-
8: Calculate ℒ𝑠 , ℒ𝑓 for users (𝑓 𝑙𝑎𝑔 = 0) by Eq. (8).
start recommendation performance. The evaluation is performed
9: Calculate ℒ𝑐𝑟𝑜𝑠𝑠 for users (𝑓 𝑙𝑎𝑔 = 0) by Eq. (9).
using two widely used evaluation metrics, including Recall@K and
10: Set ℒ𝑝𝑟𝑒 by Eq. (15).
Normalized Discounted Cumulative Gain (NDCG)@𝐾. The higher
11: Apply Adam optimizer to ℒ𝑝𝑟𝑒 .
Recall@𝐾 and NDCG@𝐾 indicate the higher the rating prediction
12: Perform back-propagation to ℒ𝑝𝑟𝑒 getting gradients 𝒢.
accuracy and the better the ranking performance for most of the
13: Update 𝒲𝑝𝑟𝑒 based on 𝒢.
preferred items. Similar to [10, 18], 𝐾 is set to 20 by default. For
14: end for
each metric, the results were averaged across all users and averaged
15: end for
over five independent experiments.
16: /* Recommended training stages. */
Baselines. In this experiment, CMCLRec uses SASRec [11] as its
17: for 𝑖 ← 1 : 𝐸 2 (number of fine-train epochs) do
recommendation model. In assessing the effectiveness of CMCLRec
18: for 𝑗 ← 1 : 𝐵 2 (number of fine-train batch size) do
for recommending both cold-start and warm users, we conducted
19: Calculate ℒ𝑝𝑟𝑒 according to the previous stage.
comparisons with five cold-start recommendation models across
20: Get 𝑋 𝑢𝑘 by Eq. (13).
two datasets.
21: Calculate ℒ𝑟𝑒𝑐 for all user by Eq. (14).
22: Set ℒ𝑓 𝑖𝑛𝑒 by Eq. (16). • DropoutNet [24] improves the cold-start problem by ran-
23: Apply Adam optimizer to ℒ𝑓 𝑖𝑛𝑒 . domly discarding embeddings to reduce the model’s depen-
24: Perform back-propagation to ℒ𝑓 𝑖𝑛𝑒 getting 𝒢. dence on user-item interactions.
25: Update 𝒲𝑝𝑟𝑒 and 𝒲𝑟𝑒𝑐 based on 𝒢. • DeepMusic [21] employ a deep convolutional neural net-
26: end for work to project users and items into a low-dimensional im-
27: end for plicit space. Recommendations are then generated by assess-
28: return 𝒲 ing the positional relationships in this space.
• MeLU [12] A predicts preferences for cold-start users, lever-
aging meta-learning, based on consumed items, and strategi-
cally addresses the user cold-start issue through the Model-
4 EXPERIMENT Agnostic Meta-Learning (MAML) approach.
We conducted comparison experiments and ablation experiments • Heater [41] introduces a Mixture-of-Experts Transforma-
on publicly available datasets to address the following three re- tion mechanism to Enhance DropoutNet, providing ’person-
search questions: alized’ transformation functions.
• RQ1: Can CMCLRec achieve the best cold start and overall • PDMA [26] enhances a preference learning decoupling frame-
recommendation performance compared to state-of-the-art work using meta-augmentation to improve user cold-start
cold-start solutions? recommendation.

1595
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xiaolong Xu et al.

Table 1: Recommendation performance comparison against baselines. The improvements are calculated by comparing CMCLRec
with the corresponding best baselines (underlined).

Overall Recommendation Cold-Start Recommendation Warm Recommendation


Method KuaiRec XING KuaiRec XING KuaiRec XING
Recall NDCG Recall NDCG Recall NDCG Recall NDCG Recall NDCG Recall NDCG
DeepMusic 0.0375 0.0186 0.1876 0.1683 0.0394 0.0192 0.2691 0.1606 0.0414 0.0285 0.4205 0.2946
DropoutNet 0.0252 0.0092 0.1733 0.1507 0.0306 0.0143 0.2773 0.1953 0.0367 0.0271 0.3034 0.2182
MeLU 0.0351 0.0115 0.1829 0.1713 0.0418 0.0162 0.2842 0.2304 0.0392 0.0294 0.4116 0.3247
Heater 0.0392 0.0153 0.2095 0.1830 0.0463 0.0179 0.3271 0.2124 0.0449 0.0327 0.3982 0.2708
PDMA 0.0437 0.0176 0.2343 0.2157 0.0413 0.0209 0.3679 0.2416 0.0324 0.0319 0.3823 0.2886
SDCRec 0.0397 0.0209 0.2231 0.1973 0.0492 0.0187 0.3018 0.2037 0.0417 0.0348 0.3421 0.2947
CMCLRec 0.0481 0.0221 0.2476 0.2282 0.0563 0.0231 0.4116 0.2606 0.0484 0.0356 0.4556 0.3418
improv. 10.07% 5.74% 5.68% 5.80% 14.43% 10.53% 11.88% 7.86% 7.80% 2.30% 8.35% 5.27%

• SDCRec [5] seamlessly models cold-start users as warm


users using a social ensemble, without using additional user-
item interaction records, to improve referrals for cold-start
users.
Implementation Details. We use the officially provided source
code to implement Baselines. In this experiment, CMCLRec adopts
SASRec [11] as its recommendation model. The Embedding dimen-
sion is set to 64 for all baseline models and CMCLRec. We employ
the Adam optimizer with a learning rate of 1 × 10 −4 . For Eq. (11),
the default value for 𝜆 is set to 0.3. For Eq. (15), the default values
for 𝛼 and 𝛽 are both set to 0.5. For Eq. (16), the default value for
𝜂 is set to 0.7. In the interest of fairness, all baseline models are
configured with identical hyperparameters and designs as specified
in their respective articles. 𝑢 𝑢𝑘
(a) Distribution of 𝑇b𝑠 ′𝑘 and 𝑇b𝑠 before the self-supervised training.

4.2 Performance Comparison (RQ1)


The comparative evaluation of CMCLRec with other baseline mod-
els on the two datasets is presented in Table 1, encompassing as-
sessments of overall recommendation performance, cold-start user
recommendation performance, and warm user recommendation
performance. To evaluate the effectiveness of CMCLRec, we con-
ducted comparative experiments with a generative model (DeepMu-
sic), two dropout models (DropoutNet, Heater), two meta-learning
models (MeLU, PDMA), and a contrastive learning model (SDCRec).
The improvement can be calculated by comparing CMCLRec with
the optimal baseline (underline).
Upon scrutinizing the table, it is evident that in the overall rec-
ommendation scenario, CMCLRec exhibited superior performance
compared to other baseline models across both datasets. Specifi- 𝑢 𝑢𝑘
cally, on the NDCG@20 metric, CMCLRec achieved an improve- (b) Distribution of 𝑇b𝑠 ′𝑘 and 𝑇b𝑠 after the self-supervised training.

ment of 5.74% (KuaiRec) and 5.80% (XING) compared to the best


baseline model PDMA. Moreover, on the Recall@20 metric, CM- Figure 2: Comparison between 𝑇b𝑠𝑢′𝑘 (cold-start user, blue) and
CLRec achieved an improvement of 10.07% (KuaiRec) and 5.68%
𝑇b𝑠𝑢𝑘 (worm user, red).
(XING) compared to the best baseline model SDCRec.
For the cold-start recommendation scene, CMCLRec exhibits
more significant improvements compared to baseline models. Specif-
ically, compared to the best baseline model, CMCLRec achieved an
increase of 10.53% (KuaiRec) and 7.86% (XING) in NDCG@20, and baseline models, CMCLRec showed an increase of 2.30% (KuaiRec)
an increase of 14.43% (KuaiRec) and 11.88% (XING) in Recall@20. and 5.27% (XING) in NDCG@20, and an increase of 7.80% (KuaiRec)
For the warm recommendation scene, in comparison to the best and 8.35% (XING) in Recall@20.

1596
CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation SIGIR ’24, July 14–18, 2024, Washington, DC, USA

Table 2: Results of Ablation Study

Overall Recommendation Cold-Start Recommendation Warm Recommendation


Method KuaiRec XING KuaiRec XING KuaiRec XING
Recall NDCG Recall NDCG Recall NDCG Recall NDCG Recall NDCG Recall NDCG
CMCLRec-NoDA 0.0468 0.022 0.2318 0.2204 0.0541 0.0218 0.4012 0.2421 0.0471 0.0312 0.4248 0.3257
CMCLRec-NoCL 0.0476 0.0217 0.2361 0.2143 0.0422 0.0162 0.3758 0.227 0.0473 0.0334 0.4409 0.3215
CMCLRec-None 0.0437 0.0209 0.2242 0.1972 0.0417 0.0148 0.3596 0.2147 0.0451 0.0267 0.4185 0.2803
CMCLRec 0.0481 0.0221 0.2476 0.2282 0.0563 0.0231 0.4116 0.2606 0.0484 0.0356 0.4556 0.3418

In all scenes, CMCL outperforms the best baseline model, which • CMCLRec-None: Removed the data augmentation module
is attributed to the success of cross-modal contrastive learning in es- as well as the cross-modal contrastive learning module and
tablishing a mapping between user features and behavior sequences performed a direct linear mapping on the original embed-
. This effectively improves the feature distribution of cold-start dings.
users. Simultaneously, for warm users, CMCL also enhances their
behavior sequence embeddings effectively, resulting in a noticeable The observations derived from the data presented in Table 2
improvement in recommendation performance. are summarized as follows. Firstly, the removal of the cross-modal
contrastive learning module results in a notable decline in recom-
mendation performance in the cold-start recommendation scene.
4.3 Visualization Study (RQ2) This is because CMCLRec-NoCL directly uses linear mapping to
In this section, we utilize t-SNE [22] to reduce the distribution generate simulated behavior sequences for cold-start users, which
dimensions of the simulated behavior sequence embedding 𝑇b𝑠𝑢′𝑘 deviates substantially from their actual preferences, resulting in
and the behavior sequence embedding 𝑇b𝑠𝑢𝑘 defined in Section 4.3 poor recommendation performance. Secondly, omitting the data
to 2 dimensions and conduct visual analysis, as shown in Figure 2, augmentation module results in a reduction in recommendation
to verify whether the disparity between cold-start users (𝑇d 𝑢𝑘 effectiveness across all scenes. This is because CMCLRec-NoDA
𝑠 ′ ) and
𝑢𝑘
warm users (𝑇𝑠 ) has effectively been reduced. directly models the original data embeddings, which inadequately
From sub-figure (a), we observe that, for the model, there is a sig- captures the preferences of users with fewer features, leading to
nificant distribution disparity between cold-start users and warm decreased recommendation performance. Thirdly, upon removing
users before training, and CMCLRec shows almost no distinguisha- the initial two modules and the cross-modal contrastive learning
bility for cold-start users. Such phenomenon is attributed to the ab- module, CMCLRec-None experiences a further decrease in recom-
sence of historical behavioral sequence for cold-start users, thereby mendation accuracy compared to CMCLRec-NoDA and CMCLRec-
compelling the model, which heavily relies on such interactions, to NoCL. This suggests that both modules have a positive impact on
treat all cold-start users uniformly. After self-supervised training, user recommendations in all scenes.
the distribution of cold-start users and warm users becomes nearly
identical, and the improvement in the distribution of cold-start 5 CONCLUSION
users is significant, as shown in sub-figure (b). This signifies that In this paper, we propose CMCLRec that addresses the cold-start
the model has gained distinctiveness for the post-training cold-start problem in sequential recommendation by generating simulated be-
user data, generating different simulated behavior sequences for havior sequences for cold-start users. CMCLRec leverages the data
various cold-start users. Therefore, during self-supervised learning, of warm users and employs the cross-modal contrastive learning
CMCLRec effectively reduces the distribution gap between cold- method to construct a mapping from user features to behavior se-
start users and warm users, which confirms the rationality of using quences. Since the embedding of simulated behavior sequences can
simulated behavioral sequences to recommend cold-start users. be directly used as input for conventional sequential recommenda-
tion models for cold-start users, CMCLRec can be directly embedded
4.4 Ablation Study (RQ3) to improve the recommendation performance of cold-start users
In comparison to traditional sequential recommendation models, for specially designed sequential recommendation models. Exper-
CMCLRec incorporates additional data augmentation and cross- imental results based on two publicly available datasets demon-
modal contrastive learning modules. Through the Ablation Study, strate that CMCLRec outperforms state-of-the-art baseline models
we investigate the effectiveness of these two modules and design the in both cold-start and warm recommendation scenes. Besides, the
following three variant models for comparison based on CMCLRec. ablation study confirms that CMCLRec effectively enhances the rec-
ommendation performance of sequential recommendation models
• CMCLRec-NoDA: Removed the data augmentation module for cold-start users.
and directly employed the original embeddings for cross-
modal contrastive learning.
• CMCLRec-NoCL: Removed the cross-modal contrastive ACKNOWLEDGMENTS
learning module and directly applied linear mapping to the This work was supported by the National Natural Science Founda-
augmented data. tion of China under Grant (no. 92267104 and 62372242).

1597
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xiaolong Xu et al.

REFERENCES [22] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
[1] Fabian Abel, Yashar Deldjoo, Mehdi Elahi, and Daniel Kohlsdorf. 2017. Recsys Journal of machine learning research 9, 11 (2008).
challenge 2017: Offline and online evaluation. In Proceedings of the eleventh acm [23] Manasi Vartak, Arvind Thiagarajan, Conrado Miranda, Jeshua Bratman, and Hugo
conference on recommender systems. 372–373. Larochelle. 2017. A meta-learning perspective on cold-start recommendations
[2] Veselka Boeva and Christian Nordahl. 2019. Modeling evolving user behavior for items. Advances in neural information processing systems 30 (2017).
[24] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Ad-
via sequential clustering. In Joint European Conference on Machine Learning and
dressing cold start in recommender systems. Advances in neural information
Knowledge Discovery in Databases. Springer, 12–20.
processing systems 30 (2017).
[3] Hao Chen, Zefan Wang, Feiran Huang, Xiao Huang, Yue Xu, Yishi Lin, Peng
[25] Chunyang Wang, Yanmin Zhu, Haobing Liu, Tianzi Zang, Jiadi Yu, and Feilong
He, and Zhoujun Li. 2022. Generative adversarial framework for cold-start item
Tang. 2022. Deep Meta-learning in Recommendation Systems: A Survey. arXiv
recommendation. In Proceedings of the 45th International ACM SIGIR Conference
preprint arXiv:2206.04415 (2022).
on Research and Development in Information Retrieval. 2565–2571.
[26] Chunyang Wang, Yanmin Zhu, Aixin Sun, Zhaobo Wang, and Ke Wang. 2023. A
[4] Shi Dong, Ping Wang, and Khushnood Abbas. 2021. A survey on deep learning
Preference Learning Decoupling Framework for User Cold-Start Recommenda-
and its applications. Computer Science Review 40 (2021), 100379.
tion. In Proceedings of the 46th International ACM SIGIR Conference on Research
[5] Jing Du, Zesheng Ye, Lina Yao, Bin Guo, and Zhiwen Yu. 2022. Socially-aware dual
and Development in Information Retrieval. 1168–1177.
contrastive learning for cold-start recommendation. In Proceedings of the 45th
[27] Jianling Wang, Kaize Ding, and James Caverlee. 2021. Sequential recommendation
International ACM SIGIR Conference on Research and Development in Information
for cold-start users with meta transitional learning. In Proceedings of the 44th
Retrieval. 1927–1932.
International ACM SIGIR Conference on Research and Development in Information
[6] Hui Fang, Danning Zhang, Yiheng Shu, and Guibing Guo. 2020. Deep learning
Retrieval. 1783–1787.
for sequential recommendation: Algorithms, influential factors, and evaluations.
[28] Shoujin Wang, Liang Hu, Yan Wang, Longbing Cao, Quan Z Sheng, and Mehmet
ACM Transactions on Information Systems (TOIS) 39, 1 (2020), 1–42.
Orgun. 2019. Sequential recommender systems: challenges, progress and
[7] Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang,
prospects. arXiv preprint arXiv:2001.04830 (2019).
Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed
[29] Xiao Wang and Guo-Jun Qi. 2022. Contrastive learning with stronger augmenta-
dataset and insights for evaluating recommender systems. In Proceedings of the
tions. IEEE transactions on pattern analysis and machine intelligence 45, 5 (2022),
31st ACM International Conference on Information & Knowledge Management.
5549–5560.
540–550.
[30] Jian Wei, Jianhua He, Kai Chen, Yi Zhou, and Zuoyin Tang. 2016. Collaborative
[8] Ruining He, Wang-Cheng Kang, Julian J McAuley, et al. 2018. Translation-based
filtering and deep learning based hybrid recommendation for cold start problem.
Recommendation: A Scalable Method for Modeling Sequential Behavior.. In IJCAI.
In 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing,
5264–5268.
14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big
[9] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
Data Intelligence and Computing and Cyber Science and Technology Congress
2015. Session-based recommendations with recurrent neural networks. arXiv
(DASC/PiCom/DataCom/CyberSciTech). IEEE, 874–877.
preprint arXiv:1511.06939 (2015).
[31] Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng
[10] Feiran Huang, Zefan Wang, Xiao Huang, Yufeng Qian, Zhetao Li, and Hao Chen.
Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings
2023. Aligning Distillation For Cold-start Item Recommendation. (2023).
of the 29th ACM International Conference on Multimedia. 5382–5390.
[11] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-
[32] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019.
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE,
Session-based recommendation with graph neural networks. In Proceedings of
197–206.
the AAAI conference on artificial intelligence, Vol. 33. 346–353.
[12] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019.
[33] Jie Xu, Tianwei Xing, and Mihaela Van Der Schaar. 2016. Personalized course
Melu: Meta-learned user preference estimator for cold-start recommendation.
sequence recommendations. IEEE Transactions on Signal Processing 64, 20 (2016),
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
5340–5352.
Discovery & Data Mining. 1073–1082.
[34] Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya
[13] Weiming Liu, Xiaolin Zheng, Jiajie Su, Longfei Zheng, Chaochao Chen, and
Menon, Lichan Hong, Ed H Chi, Steve Tjoa, Jieqi Kang, et al. 2021. Self-supervised
Mengling Hu. 2023. Contrastive Proxy Kernel Stein Path Alignment for Cross-
learning for large-scale item recommendations. In Proceedings of the 30th ACM
Domain Cold-Start Recommendation. IEEE Transactions on Knowledge and Data
International Conference on Information & Knowledge Management. 4321–4330.
Engineering (2023).
[35] Rong Ye, Mingxuan Wang, and Lei Li. 2022. Cross-modal contrastive learning
[14] Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. 2019. Warm
for speech translation. arXiv preprint arXiv:2205.02444 (2022).
up cold-start advertisements: Improving ctr predictions via learning to learn id
[36] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee
embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on
Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural
Research and Development in Information Retrieval. 695–704.
modeling for large corpus item recommendations. In Proceedings of the 13th ACM
[15] Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu,
Conference on Recommender Systems. 269–277.
and Wayne Xin Zhao. 2022. Multimodal meta-learning for cold-start sequential
[37] Eva Zangerle and Christine Bauer. 2022. Evaluating recommender systems:
recommendation. In Proceedings of the 31st ACM international conference on
survey and framework. Comput. Surveys 55, 8 (2022), 1–38.
information & knowledge management. 3421–3430.
[38] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021.
[16] Xiuyuan Qin, Huanhuan Yuan, Pengpeng Zhao, Junhua Fang, Fuzhen Zhuang,
Cross-modal contrastive learning for text-to-image generation. In Proceedings of
Guanfeng Liu, Yanchi Liu, and Victor Sheng. 2023. Meta-optimized Contrastive
the IEEE/CVF conference on computer vision and pattern recognition. 833–842.
Learning for Sequential Recommendation. arXiv preprint arXiv:2304.07763 (2023).
[39] Yu Zhu, Jinghao Lin, Shibi He, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng
[17] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-
Cai. 2019. Addressing the item cold-start problem by attribute-driven active
izing personalized markov chains for next-basket recommendation. In Proceedings
learning. IEEE Transactions on Knowledge and Data Engineering 32, 4 (2019),
of the 19th international conference on World wide web. 811–820.
631–644.
[18] Walid Shalaby, Sejoon Oh, Amir Afsharinejad, Srijan Kumar, and Xiquan Cui.
[40] Yongchun Zhu, Ruobing Xie, Fuzhen Zhuang, Kaikai Ge, Ying Sun, Xu Zhang,
2022. M2TRec: Metadata-aware Multi-task Transformer for Large-scale and
Leyu Lin, and Juan Cao. 2021. Learning to warm up cold item embeddings for cold-
Cold-start free Session-based Recommendations. In Proceedings of the 16th ACM
start recommendation with meta scaling and shifting networks. In Proceedings
Conference on Recommender Systems. 573–578.
of the 44th International ACM SIGIR Conference on Research and Development in
[19] Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative filtering beyond
Information Retrieval. 1167–1176.
the user-item matrix: A survey of the state of the art and future challenges. ACM
[41] Ziwei Zhu, Shahin Sefati, Parsa Saadatpanah, and James Caverlee. 2020. Recom-
Computing Surveys (CSUR) 47, 1 (2014), 1–45.
mendation for new users and new items via randomized training and mixture-
[20] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
of-experts transformation. In Proceedings of the 43rd International ACM SIGIR
tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
Conference on Research and Development in Information Retrieval. 1121–1130.
international conference on web search and data mining. 565–573.
[42] Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Cross-
[21] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep
clr: Cross-modal contrastive learning for multi-modal video representations. In
content-based music recommendation. Advances in neural information processing
Proceedings of the IEEE/CVF International Conference on Computer Vision. 1450–
systems 26 (2013).
1459.

1598

You might also like