Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation
Abstract
:1. Introduction
- -
- Important design issues for the VidSGG model are presented, and a novel deep neural network model, VSGG-Net, is proposed to effectively cope with these issues.
- -
- A new tracklet pair proposal method that evaluates the relatedness of object tracklet pairs using the pretrained neural network and statistical information is presented.
- -
- The proposed model performs low-level visual context reasoning and high-level semantic context reasoning using a spatio-temporal context graph and a graph neural network to obtain rich spatio-temporal context.
- -
- The proposed model applies a class weighting technique that increases the weight of sparse relationships in the classification loss function to improve the detection performance for sparse relationships.
- -
- The positive effect and high performance of the proposed model are proven through the experiments using the benchmark datasets, VidOR and VidVRD.
2. Related Work
2.1. Visual Scene Graph Generation
2.2. Video Visual Relation Detection
3. VSGG-Net: The Video Scene Graph Generation Model
3.1. Model Overview
3.2. Object Tracklet Detection and Pair Proposal
3.3. Context Reasoning and Classification
4. Experiments
4.1. Dataset and Model Training
4.2. Experiments
4.3. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Xu, P.; Chang, X.; Guo, L.; Huang, P.Y.; Chen, X. A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst. 2020. [Google Scholar] [CrossRef]
- Xie, W.; Ren, G.; Liu, S. Video relation detection with Trajectory-aware multi-modal features. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4590–4594. [Google Scholar]
- Shang, X.; Ren, T.; Guo, J.; Zhang, H.; Chua, T.S. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1300–1308. [Google Scholar]
- Sun, X.; Ren, T.; Zi, Y.; Wu, G. Video visual relation detection via multi-modal feature fusion. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2657–2661. [Google Scholar]
- Qian, X.; Zhuang, Y.; Li, Y.; Xiao, S.; Pu, S.; Xiao, J. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 84–93. [Google Scholar]
- Tsai, Y.H.H.; Divvala, S.; Morency, L.P.; Salakhutdinov, R.; Farhadi, A. Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 10424–10433. [Google Scholar]
- Su, Z.; Shang, X.; Chen, J.; Jiang, Y.G.; Qiu, Z.; Chua, T.S. Video Relation Detection via Multiple Hypothesis Association. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3127–3135. [Google Scholar]
- Zheng, S.; Chen, X.; Chen, S.; Jin, Q. Relation understanding in videos. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2662–2666. [Google Scholar]
- Liu, C.; Jin, Y.; Xu, K.; Gong, G.; Mu, Y. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 14–19 June 2020; pp. 10840–10849. [Google Scholar]
- Shang, X.; Di, D.; Xiao, J.; Cao, Y.; Yang, X.; Chua, T.S. Annotating objects and relations in user-generated videos. In Proceedings of the International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 October 2019; pp. 279–287. [Google Scholar]
- Dai, B.; Zhang, Y.; Lin, D. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3076–3086. [Google Scholar]
- Xu, D.; Zhu, Y.; Choy, C.B.; Li, F.F. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5410–5419. [Google Scholar]
- Li, Y.; Ouyang, W.; Zhou, B.; Wang, K.; Wang, X. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–27 October 2017; pp. 1261–1270. [Google Scholar]
- Yang, J.; Lu, J.; Lee, S.; Batra, D.; Parikh, D. Graph R-CNN for scene graph generation. In Proceedings of the Europa Conference Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 670–685. [Google Scholar]
- Zellers, R.; YatFBar, M.; Thomson, S.; Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake, UT, USA, 18–22 June 2018; pp. 5831–5840. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Europa Conference Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lasvegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE international conference on image processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Notation | Description |
---|---|
Complete graph | |
Sparse graph | |
Context graph | |
Temporal intersection over union | |
Object | |
Object class distribution | |
Co-occurrence matrix | |
Attention on the edge connecting two object nodes and | |
Center coordinate of the bounding box of an object | |
Size of the bounding box of an object | |
Information received from neighboring nodes |
Performance Metrics | Description |
---|---|
R@1, R@50, R@100 | R@K is the recall with Top K results. |
P@1, P@5, P@10 | P@K is the precision with Top K results. |
mAP | mean Average Precision. The mean of the average precision scores for each query |
Method | # of pairs (↓) | R@1 (↑) | P@5 (↑) |
---|---|---|---|
None | 958,192 | - | 0.467 |
TF | 21,560 | - | 20.78 |
TF+NS | 10,235 | 70.79 | 30.98 |
TF+SS | 13,992 | 97.44 | 31.19 |
TF+NS+SS (Ours) | 13,965 | 97.37 | 31.23 |
Method | Relation Detection | Relation Tagging | |||
---|---|---|---|---|---|
R@50 | R@100 | mAP | P@1 | P@5 | |
GCN | 6.11 | 7.32 | 6.24 | 50.70 | 47.81 |
S-GCN | 7.66 | 10.51 | 8.90 | 55.39 | 52.54 |
T-GCN | 7.74 | 10.85 | 8.97 | 56.51 | 54.35 |
ST-GNN (Ours) | 8.15 | 11.53 | 9.80 | 58.44 | 54.16 |
Level. | Relation Detection | Relation Tagging | |||
---|---|---|---|---|---|
R@50 | R@100 | mAP | P@1 | P@5 | |
visual reasoning | 7.98 | 11.48 | 9.76 | 58.21 | 54.09 |
semantic reasoning | 5.63 | 6.42 | 5.69 | 50.90 | 45.02 |
visual reasoning + semantic reasoning (Ours) | 8.15 | 11.53 | 9.80 | 58.44 | 54.16 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jung, G.; Lee, J.; Kim, I. Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation. Sensors 2021, 21, 3164. https://doi.org/10.3390/s21093164
Jung G, Lee J, Kim I. Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation. Sensors. 2021; 21(9):3164. https://doi.org/10.3390/s21093164
Chicago/Turabian StyleJung, Gayoung, Jonghun Lee, and Incheol Kim. 2021. "Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation" Sensors 21, no. 9: 3164. https://doi.org/10.3390/s21093164
APA StyleJung, G., Lee, J., & Kim, I. (2021). Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation. Sensors, 21(9), 3164. https://doi.org/10.3390/s21093164