Spatio-Temporal Transformer Recommender: Next Location Recommendation with Attention Mechanism by Mining the Spatio-Temporal Relationship between Visited Locations
Round 1
Reviewer 1 Report
It is an interesting topic as already mentioned in the body of text. There are several applications for recommending the next POI for different users based upon historical spatio-temporal data. It is advantagious to consider not contiguous visits in the model.
Beside the performance evaluation, It would be great to add a section to possibly show case some selected users and for specified time framed trajectories to output their next geographic POIs which would be offered based on your model.
It would be nice to add some extra information how the adjacency (in terms of distance) of previously visited POIs may impact the results for recommending the next POIs with some examples.
Author Response
It is an interesting topic as already mentioned in the body of text. There are several applications for recommending the next POI for different users based upon historical spatio-temporal data. It is advantagious to consider not contiguous visits in the model.
Beside the performance evaluation, It would be great to add a section to possibly show case some selected users and for specified time framed trajectories to output their next geographic POIs which would be offered based on your model.
Response 1:
Thanks for your suggestion. We added a new figure (figure 6) and some explanations to show the results of our model recommendation for one user. Currently, our model can recommend the location of the next visit. In addition, the specified time-frame trajectory to output the user's next geographic POIs would be offered on our model if more travel information, such as weather, transportation, etc., is available. Please see revisions between Pg. 12, Line 446 – Line 448 in the clean version.
It would be nice to add some extra information how the adjacency (in terms of distance) of previously visited POIs may impact the results for recommending the next POIs with some examples
Response 2:
Thanks for your suggestion. We add a new figure (figure1) and some explanations in the manuscript to explain how the adjacency (in terms of distance) of previously visited POIs may impact the results. Please see revisions between Pg. 3, Line 87 – Line 94 and Pg. 4, Line 99 – Line 101 in the clean version.
Reviewer 2 Report
This paper proposes a Transformer-based model STTF-Recommender for Next Location Recommendation (NLR). Thanks to the multi-head attention mechanism in the Transformer, this model can better extract the spatio-temporal correlation information between discontinuous visit points. Thus STTF-Recommender can model both short-range and long-range preferences of users at POI points. The application of Transformer, which is very popular in natural language processing, to NLR is a relatively novel research topic. Compared with RNNs and other models, the Transformer model has a significant advantage in sequential recommendations that account for spatio-temporal information. However, the paper needs more innovation as making special improvements to Transformer to make it more suitable for NLR, while there is a literature that has applied Transformer to NLR with exploratory model improvements[1] and can be used for comparison. Additionally, there are some improvements that could be made to the article as follows:
1. Some numerical indicators should be used at the end of the abstract to demonstrate the strengths of the proposed model, rather than just describing it qualitatively.
2. In the article contribution at the end of the introduction, consider transferring the second last point to the end of the abstract and the first last point to the body of the introduction.
3. Please correct the title of section 2.1, the current title is "Subsection".
4. There are several inconsistencies in the symbols of variables in the article. Please carefully check the consistency of the symbols in Section 3 Preliminaries. In addition, there are several variables that lack explanation, such as rm in traui in line 203, and the nd and st in subsection 5.1.1 Datasets.
5. Please optimize Figure 1, the information contained in this model framework is obscure, for example, the meaning of the arrow from Matching Loss to Spatio-temporal Embedding is unclear.
6. In 4.1 Spatio-Temporal Embedding Layer, the authors directly sum the representations of user u, location p and time t to obtain the representation of this check-in record. Does this approach retain the respective location and time information well enough to support the subsequent modelling of the spatio-temporal relationship of the check-in history? The authors should compare various representation calculation methods to improve the persuasiveness of the method design.
7. In Section 4.3 Output Layer, the authors use the Attention matcher based on the Attention mechanism to select the optimal candidate positions. However, the presence or absence of in Eq. (6) does not affect the result of this equation, so the Attention matcher can be degraded to a combination of dot product and softmax. In addition, the authors need to clarify the behavior of Sum in Eq. (6), although it can be roughly inferred from a thorough reading that it is a summation by rows, it sets an obstacle to the reader's fluency.
8. In Section 5 Performance Evaluation, the authors only use the recall rate as the evaluation metric. The evaluation criteria are not sufficiently comprehensive, and the authors should consider adding Precision and F1 and other indicators to evaluate the models.
References
[1] S. Halder, K. H. Lim, J. Chan, and X. Zhang, “Transformer-based multi-task learning for queuing time aware next POI recommendation,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2021, pp. 510–523.
Author Response
This paper proposes a Transformer-based model STTF-Recommender for Next Location Recommendation (NLR). Thanks to the multi-head attention mechanism in the Transformer, this model can better extract the spatio-temporal correlation information between discontinuous visit points. Thus STTF-Recommender can model both short-range and long-range preferences of users at POI points. The application of Transformer, which is very popular in natural language processing, to NLR is a relatively novel research topic. Compared with RNNs and other models, the Transformer model has a significant advantage in sequential recommendations that account for spatio-temporal information. However, the paper needs more innovation as making special improvements to Transformer to make it more suitable for NLR, while there is a literature that has applied Transformer to NLR with exploratory model improvements [1] and can be used for comparison. Additionally, there are some improvements that could be made to the article as follows:
- Some numerical indicators should be used at the end of the abstract to demonstrate the strengths of the proposed model, rather than just describing it qualitatively.
Response 1:
Thanks for your suggestion. It is really a great suggestion as you pointed out and we have changed them. Please see revisions between Pg. 1, Line 27 – Line 29 in the clean version.
- In the article contribution at the end of the introduction, consider transferring the second last point to the end of the abstract and the first last point to the body of the introduction.
Response 2:
Thanks for your suggestion. There are really good ideas as you suggested, and we have changed them. Please see revisions between Pg. 3, Line 117 – Line 119 in the clean version.
- Please correct the title of section 2.1, the current title is "Subsection".
Response 3:
Thanks for your suggestion. We have corrected our mistake and changed "Subsection" into "Sequential recommendation". Please see revisions between Pg. 4, Line 143 in the clean version.
- There are several inconsistencies in the symbols of variables in the article. Please carefully check the consistency of the symbols in Section 3 Preliminaries. In addition, there are several variables that lack explanation, such as rm in traui in line 203, and the nd and st in subsection 5.1.1 Datasets.
Response 4:
Thanks for your suggestion. It is our mistakes. We have corrected it and added some explanations to the symbols in Section 3. Please see revisions between Pg. 5, Line 214 – Line 227 and Pg. 9, Line 347 – Line 350 in the clean version.
- Please optimize Figure 1, the information contained in this model framework is obscure, for example, the meaning of the arrow from Matching Loss to Spatio-temporal Embedding is unclear.
Response 5:
Thanks for your suggestion. We have optimized this Figure (It is Figure2 in new version.) and removed the arrow from Matching Loss to Spatio-temporal Embedding. Please see revisions between Pg. 6, Line 243 – Line 245 in the clean version.
- In 4.1 Spatio-Temporal Embedding Layer, the authors directly sum the representations of user u, location p and time t to obtain the representation of this check-in record. Does this approach retain the respective location and time information well enough to support the subsequent modelling of the spatio-temporal relationship of the check-in history? The authors should compare various representation calculation methods to improve the persuasiveness of the method design.
Response 6:
Thanks for your suggestion. Our model can retain the respective location and time information well enough to support the subsequent modelling of the spatio-temporal relationship of the check-in history. This is because that our model is based on STAN model [1], which is a kind of spatio-temporal trajectory embedding method. Therefore, our model can incorporate time in the location prediction and make good use of these spatio-temporal relationships.
- In Section 4.3 Output Layer, the authors use the Attention matcher based on the Attention mechanism to select the optimal candidate positions. However, the presence or absence of in Eq. (6) does not affect the result of this equation, so the Attention matcher can be degraded to a combination of dot product and softmax. In addition, the authors need to clarify the behavior of Sum in Eq. (6), although it can be roughly inferred from a thorough reading that it is a summation by rows, it sets an obstacle to the reader's fluency.
Response 7:
Thanks for your suggestion. In fact, the Attention matcher includes the dot product and softmax, as shown in the Eq.(5) and Eq.(6). We put the Eq.(6) in this way because we were inspired by the attention mechanism [1]. We want to the formal expression is consistent, so we split it into Eq.(5) and Eq.(6). The sum operation is a weighted sum of the last dimension, converting the dimension of A(u). We have added the relevant formula description. Please see revisions between Pg. 8, Line 315 – Line 323 in the clean version.
- In Section 5 Performance Evaluation, the authors only use the recall rate as the evaluation metric. The evaluation criteria are not sufficiently comprehensive, and the authors should consider adding Precision and F1 and other indicators to evaluate the models.
References
[1] S. Halder, K. H. Lim, J. Chan, and X. Zhang, “Transformer-based multi-task learning for queuing time aware next POI recommendation,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2021, pp. 510–523.
Response 8:
Thanks for the advice. These metrics are indeed very important evaluation metrics in many cases. Thank you for the paper you recommended. These metrics are also used in this paper, Precision@k, Recall@k and F1@k. But as you can see from the Figure 3 in TLR-M, the values of these metrics are actually the same. That's because in our scenario, the TP+FP is k times as much as TP+FN. Our experimental results are consistent with those in the paper you recommended. Considering the redundancy, we did not list these repeated results as shown the Figure 3 in TLR-M.
References
[1] Y. Luo, Q. Liu, and Z. Liu, “STAN: Spatio-temporal attention network for next location recommendation,” 2021.
Reviewer 3 Report
In this paper, the authors developed a multi‐layer Spatio‐Temporal deep learning attention model for POI recommendation by mining spatio-temporal relationship between visited locations. The topic fits well the scope of the journal, and the evaluation results indicate better performance of the developed method. Overall, the paper is well organized. Before the acceptance of the paper for publication, the following issues should be addressed to improve the manuscript.
1. In lines 69 and 89, the authors keep emphasizing the ignorance of discrete visits data in previous studies. Proper reasons and/or technical difficulties behind this issue should be explained here.
2. Line 202: “lk , tk is the timestamp and location of” should be “… is the location and timestap”, please correct. Also, in line 199, location was denoted as p but here the authors used l, needs to unify the symbols.
3. Line 203: what does r stand for? Meanwhile, “Users with too few check-ins are discarded”, so what is the threshold applied in this paper to filter out records with limited samples, needs to clarify in more detail.
4. Section 4.1: the authors only detailed how to encode time stamp but without providing any descriptions on the encoder of user and location. Needs to provide more details.
5. Section 5.1.2: the authors selected a set of models for comparison to evaluate the performance of the developed model. In addition to these benchmark models, the authors should also perform ablation experiments with the current model by incorporating (or not) the discrete visits data to justify the superiority of including discrete samples, since the authors have emphasized this for many times in literature review section.
Author Response
In this paper, the authors developed a multi‐layer Spatio‐Temporal deep learning attention model for POI recommendation by mining spatio-temporal relationship between visited locations. The topic fits well the scope of the journal, and the evaluation results indicate better performance of the developed method. Overall, the paper is well organized. Before the acceptance of the paper for publication, the following issues should be addressed to improve the manuscript.
- In lines 69 and 89, the authors keep emphasizing the ignorance of discrete visits data in previous studies. Proper reasons and/or technical difficulties behind this issue should be explained here.
Response 1:
Thank you for your suggestion. What you said is quite reasonable. The main difficulty behind this problem is most deep learning-based approaches for modeling the user preferences suffer from the drawback of being unable to model the relations between two nonconsecutive POIs, as they can only model consecutive activities in the user’s check-in sequence. Please see revisions between Pg. 2, Line 77 – Line 80 in the clean version.
- Line 202: “lk , tk is the timestamp and location of” should be “… is the location and timestap”, please correct. Also, in line 199, location was denoted as p but here the authors used l, needs to unify the symbols.
Response 2:
Thanks for your suggestion. It is our mistakes and now we have corrected them. Please see revisions between Pg. 6, Line 218 and Pg. 6, Line 217 in the clean version.
- Line 203: what does r stand for? Meanwhile, “Users with too few check-ins are discarded”, so what is the threshold applied in this paper to filter out records with limited samples, needs to clarify in more detail.
Response 3:
Thanks for your suggestion.
(1) We have corrected and supplemented the symbols. The trajectory of user ui is temporally ordered check-ins. Each check-in rk of user ui is represented as a tuple {ui, pk, tk}. Please see revisions between Pg. 5, Line 217 – Line 220 in the clean version.
(2) Thank you for the reminder. We change “Users too few check‐ins are discarded” to “Inactive users with too few check-ins (less than 10) are discarded”. Please see revisions between Pg. 5, Line 220 – Line 221 in the clean version.
- Section 4.1: the authors only detailed how to encode time stamp but without providing any descriptions on the encoder of user and location. Needs to provide more details.
Response 4:
Thanks for your suggestion. The user and location have been encoded in the data set. We add descriptions to this part. Please see revisions between Pg. 6, Line 247 – Line 252 in the clean version.
- Section 5.1.2: the authors selected a set of models for comparison to evaluate the performance of the developed model. In addition to these benchmark models, the authors should also perform ablation experiments with the current model by incorporating (or not) the discrete visits data to justify the superiority of including discrete samples, since the authors have emphasized this for many times in literature review section.
Response 5:
Thanks for your suggestion. To justify the superiority of including discrete samples, we added an ablation experiment and the result is updated in Figure 1., which the results were consistent with the results of the expected analysis. Please see revisions between Pg. 10, Line 401 – Line 403, Pg. 11, Line 412 – Line 414 and Pg. 11, Line 417 – Line 419 in the clean version.
References
[1] Zhao, K., Zhang, Y., Yin, H., Wang, J., Zheng, K., Zhou, X., & Xing, C. (2020, January). Discovering Subsequence Patterns for Next POI Recommendation. In IJCAI (pp. 3216-3222).
Reviewer 4 Report
The authors present a new approach based on deep learning for a recommender system aiming at proposing a good next point of interest to a user in a location-based social network. To do so, they propose an architecture that relies on attention mechanisms and on the well-known transformer module.
The introduction of the paper is well-written and well-justified. The authors elegantly present their problem and justify it quite convincingly.
The related work needs some improvement. There are mistakes throughout the text (some noted below). The last paragraph contains no reference. It doesn’t seem to belong in the related work section.
Section 3 needs to be corrected. They use both p and l for location. The symbol r is undefined. Also, they say the trajectory is a list of check-in (which are tuples), but in the last 2 lines, it is simply a list of t. Finally, the goal is to find p at r_m+1, but that is also unclear since m is used for t, and with r it was m_i.
In section 4, the first figure should be reworked. It is difficult to understand how the model works from it. At the beginning of 4.1, they say they rely on STAN, but do not explain it. A paper should be readable standalone. Please explain the part you used from STAN. The explanation following makes very little sense to a reader unfamiliar with their work. I’m not sure why they introduce another symbol (e) for each input, especially since it seems they are simply doing a concat operation. It would have been more interesting to explain the embedding operation. A similar thing occurs in 4.2, symbols are undefined or poorly defined. The authors successfully made something simple look utterly complex. Yet, we are missing much important information. For instance, they apply a “Position‐wise Feed‐Forward Network” but fail to mention the number of units/layers, the same thing with the dropout with no mention of its configuration. Overall, section 4 needs to be improved for the paper to be publishable. I also suggest adding the complete architecture to the section.
In section 5, many references are not showing correctly. I read this section a bit more quickly since it was clear from section 4 that the paper needed at least a major revision, but the results seemed robust enough with good quality. I would have liked to see other metrics than just the Recall, to be sure nothing else was going on.
Nevertheless, I think the paper has good potential, but that the model is poorly explained. Moreover, the paper contains several mistakes and inconsistency that needs to be addressed. I suggest a rejection or a major revision. I am willing to read an improved version.
Here are some random things I noted (not exhaustive):
Line 128 – the title is missing
Line 147 – a point is missing
Line 177 – if you put acronym, especially for conference, you have to define them beforehand
Line 301 – the sentence make no sense, it should be reformulated
Line 304 – What is L?
Line 306 – Last sentence need to be corrected
Author Response
The authors present a new approach based on deep learning for a recommender system aiming at proposing a good next point of interest to a user in a location-based social network. To do so, they propose an architecture that relies on attention mechanisms and on the well-known transformer module.
The introduction of the paper is well-written and well-justified. The authors elegantly present their problem and justify it quite convincingly.
The related work needs some improvement. There are mistakes throughout the text (some noted below). The last paragraph contains no reference. It doesn’t seem to belong in the related work section.
Response 1:
Thanks for your suggestion. We have added some relevant references in Section 2 and corrected these mistakes. Thank you for your reminding.
Section 3 needs to be corrected. They use both p and l for location. The symbol r is undefined. Also, they say the trajectory is a list of check-in (which are tuples), but in the last 2 lines, it is simply a list of t. Finally, the goal is to find p at r_m+1, but that is also unclear since m is used for t, and with r it was m_i.
Response 2:
Thanks for your suggestion. It is our mistakes. We have corrected it and added some explanations to the symbols in Section 3. Please see revisions between Pg. 5, Line 214 – Line 227 in the clean version.
In section 4, the first figure should be reworked. It is difficult to understand how the model works from it. At the beginning of 4.1, they say they rely on STAN, but do not explain it. A paper should be readable standalone. Please explain the part you used from STAN. The explanation following makes very little sense to a reader unfamiliar with their work. I’m not sure why they introduce another symbol (e) for each input, especially since it seems they are simply doing a concat operation. It would have been more interesting to explain the embedding operation. A similar thing occurs in 4.2, symbols are undefined or poorly defined. The authors successfully made something simple look utterly complex. Yet, we are missing much important information. For instance, they apply a “Position‐wise Feed‐Forward Network” but fail to mention the number of units/layers, the same thing with the dropout with no mention of its configuration. Overall, section 4 needs to be improved for the paper to be publishable. I also suggest adding the complete architecture to the section.
Response 3:
Thanks for your suggestion.
(1) We have optimized this Figure (It is Figure2 in new version.). Please see revisions between Pg. 6, Line 243– Line 245 in the clean version.
(2) The multi-modal embedding module of STAN consists of two parts, a trajectory embedding layer and a spatio-temporal embedding layer. The trajectory embedding layer is a multi-modal embedding layer used to encode user, location and time into latent representations. The spatio-temporal embedding layer is a unit embedding layer used for the dense representation of spatial and temporal differences with an hour and hundred meters as basic units, respectively. Inspired by the trajectory embedding layer in STAN, we carry out trajectory embedding, which can learn the dense representations of user, location, time, and spatiotemporal effect.
(3) Referring to STAN model [2], we introduce the symbol (e) to describe latent representations. Please see revisions between Pg. 6, Line 247 – Line 252 in the clean version.
(4) It is our mistakes. We have corrected and supplemented the symbols, which added in section 5.1.3. As for the details, please see revisions between Pg. 9, Line 368 – Line 370 in the clean version.
In section 5, many references are not showing correctly. I read this section a bit more quickly since it was clear from section 4 that the paper needed at least a major revision, but the results seemed robust enough with good quality. I would have liked to see other metrics than just the Recall, to be sure nothing else was going on.
Response 4:
Thanks for your suggestion. Recall rate is the most commonly used metrics for NLR. Having referred to the TLR-M [1], we also considered using Precision and F1 as our metrics, but we only used one positive sample as the test set, the TP+FP is k times as much as TP+FN in our application. Considering the redundancy, we did not list these repeated results as shown in TLR-M Figure 3.
Nevertheless, I think the paper has good potential, but that the model is poorly explained. Moreover, the paper contains several mistakes and inconsistency that needs to be addressed. I suggest a rejection or a major revision. I am willing to read an improved version.
Here are some random things I noted (not exhaustive):
Line 128 – the title is missing
Line 147 – a point is missing
Line 177 – if you put acronym, especially for conference, you have to define them beforehand
Line 301 – the sentence make no sense, it should be reformulated
Line 304 – What is L?
Line 306 – Last sentence need to be corrected
Response 5:
We are very sorry for our negligence of the mistakes and inconsistency. We have tried our best to improve and correct these mistakes in the manuscript. We are very grateful to you for reviewing the paper so carefully. As for the details, please see revisions between Pg. 4, Line 143, 162, Pg. 5, Line 192 – 193, Pg. 8, Line 324 – 325, Pg. 8, Line 326 and Pg. 8, Line 327 – 328 in the clean version.
References
[1] S. Halder, K. H. Lim, J. Chan, and X. Zhang, “Transformer-based multi-task learning for queuing time aware next POI recommendation,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2021, pp. 510–523.
[2] Y. Luo, Q. Liu, and Z. Liu, “STAN: Spatio-temporal attention network for next location recommendation,” 2021.
Round 2
Reviewer 2 Report
This version is a significant improvement over the previous version, but there are still some points that could be optimized:
1. There are two figures numbered Figure 5; the second Figure 5 is not mentioned in the text.
2. The STAN model cited in this manuscript also considers the association between non-consecutive and non-adjacent trajectory points, and is also based on the Transformer model. Hence, there is some overlap between STAN and the proposed model STTF. To ensure the novelty of the proposed model, the authors should explain more clearly the advantages of the proposed model over the STAN model in the experiment and discussion sections.
Author Response
This version is a significant improvement over the previous version, but there are still some points that could be optimized:
- There are two figures numbered Figure 5; the second Figure 5 is not mentioned in the text.
Response 1:
Thanks for your suggestion. It is our mistake and now we have corrected them. Please see revisions between Pg. 12, Line 468 – Line 470 in the clean version.
- The STAN model cited in this manuscript also considers the association between non-consecutive and non-adjacent trajectory points, and is also based on the Transformer model. Hence, there is some overlap between STAN and the proposed model STTF. To ensure the novelty of the proposed model, the authors should explain more clearly the advantages of the proposed model over the STAN model in the experiment and discussion sections.
Response 2:
Thanks for your advice. We han add some explanations to explain more clearly the advantages of the proposed model over the STAN model, the specific changes are as follows:
Compared with STAN, the results showed that our STTF-Recommender improved rec-ommendation performance by about 5%. This is due to our model employed a multi-head attention layer and a position-based feed-forward network. The multi-head attention al-lows model to jointly attend to information from different representation subspaces at dif-ferent positions. Furthermore, the position-based feed-forward network can also improves the performance of the model, as demonstrated in Section 5.3.
For the detailed version please see revisions between Pg. 10, Line 410 – Line 417 in the clean version.
Reviewer 4 Report
The authors submitted a revised version of their manuscript. My comment regarding the last paragraph of the related work was not addressed. They simply added one reference. They minimally addressed the symbols issues, but only what I explicitly mentioned. They barely touched the Figure and did not really improve the explanation. They did not explained STAN in section 4.1 and did not improve their explanations of the symbols in 4.2. I compared the two PDF and I could see only very minimal changes were done. So, I still maintain my decision.
Author Response
The authors submitted a revised version of their manuscript. (1) My comment regarding the last paragraph of the related work was not addressed. They simply added one reference. (2) They minimally addressed the symbols issues, but only what I explicitly mentioned. (3) They barely touched the Figure and did not really improve the explanation. (4) They did not explained STAN in section 4.1 and (5) did not improve their explanations of the symbols in 4.2. I compared the two PDF and I could see only very minimal changes were done. So, I still maintain my decision.
Response 1:
We sincerely thank you for reading our manuscript so carefully and pointing out our mistakes. Our responses to your suggestion on our manuscript are as follows:
(1) Thanks for your suggestion. We have added some relevant references in Section 2. Please see revisions between Pg. 5, Line 200 – Line 210 in the clean version.
(2) Thank you for your valuable comments. We have added a tabel to description the main notations. Please see revisions between Pg. 5, Line 217 – Line 233 in the clean version.
(3) Thanks for your advice. We have optimized this Figure 2, and added explanations “The model is mainly divided into three layers with seven steps, from ① to ⑦”. Please see revisions between Pg. 6, Line 240 – Line 254 in the clean version.
(4) Thanks for your suggestion.We have added some explanations to STAN. Please see revisions between Pg. 6, Line 257 – Pg. 7, Line 267 in the clean version. We also compared our model with the STAN. Please see revisions between Pg. 10, Line 410 – Line 417 in the clean version.
(5) Thank you for your valuable comments. We have added a tabel to description the main notations and corrected some symbols. Please see revisions between Pg. 5, Line 217 – Line 233 and Pg. 7, Line 279 – Line 288 in the clean version.