{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,4]],"date-time":"2025-11-04T23:42:41Z","timestamp":1762299761363,"version":"build-2065373602"},"reference-count":21,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2021,5,2]],"date-time":"2021-05-02T00:00:00Z","timestamp":1619913600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"MSIT (Ministry of Science and ICT)","award":["IITP-2017-0-01642"],"award-info":[{"award-number":["IITP-2017-0-01642"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Video scene graph generation (ViDSGG), the creation of video scene graphs that helps in deeper and better visual scene understanding, is a challenging task. Segment-based and sliding-window based methods have been proposed to perform this task. However, they all have certain limitations. This study proposes a novel deep neural network model called VSGG-Net for video scene graph generation. The model uses a sliding window scheme to detect object tracklets of various lengths throughout the entire video. In particular, the proposed model presents a new tracklet pair proposal method that evaluates the relatedness of object tracklet pairs using a pretrained neural network and statistical information. To effectively utilize the spatio-temporal context, low-level visual context reasoning is performed using a spatio-temporal context graph and a graph neural network as well as high-level semantic context reasoning. To improve the detection performance for sparse relationships, the proposed model applies a class weighting technique that adjusts the weight of sparse relationships to a higher level. This study demonstrates the positive effect and high performance of the proposed model through experiments using the benchmark dataset VidOR and VidVRD.<\/jats:p>","DOI":"10.3390\/s21093164","type":"journal-article","created":{"date-parts":[[2021,5,2]],"date-time":"2021-05-02T08:05:21Z","timestamp":1619942721000},"page":"3164","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7314-8108","authenticated-orcid":false,"given":"Gayoung","family":"Jung","sequence":"first","affiliation":[{"name":"Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7436-5561","authenticated-orcid":false,"given":"Jonghun","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5754-133X","authenticated-orcid":false,"given":"Incheol","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,2]]},"reference":[{"key":"ref_1","unstructured":"Xu, P., Chang, X., Guo, L., Huang, P.Y., and Chen, X. (2020). A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Xie, W., Ren, G., and Liu, S. (2020, January 12\u201316). Video relation detection with Trajectory-aware multi-modal features. Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3416284"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Shang, X., Ren, T., Guo, J., Zhang, H., and Chua, T.S. (2017, January 23\u201327). Video visual relation detection. Proceedings of the ACM International Conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123380"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Sun, X., Ren, T., Zi, Y., and Wu, G. (2019, January 21\u201325). Video visual relation detection via multi-modal feature fusion. Proceedings of the ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3356076"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Qian, X., Zhuang, Y., Li, Y., Xiao, S., Pu, S., and Xiao, J. (2019, January 21\u201325). Video relation detection with spatio-temporal graph. Proceedings of the ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3351058"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Tsai, Y.H.H., Divvala, S., Morency, L.P., Salakhutdinov, R., and Farhadi, A. (2019, January 16\u201320). Video relationship reasoning using gated spatio-temporal energy graph. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01067"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Su, Z., Shang, X., Chen, J., Jiang, Y.G., Qiu, Z., and Chua, T.S. (2020, January 12\u201316). Video Relation Detection via Multiple Hypothesis Association. Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413764"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zheng, S., Chen, X., Chen, S., and Jin, Q. (2019, January 21\u201325). Relation understanding in videos. Proceedings of the ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3356080"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Liu, C., Jin, Y., Xu, K., Gong, G., and Mu, Y. (2020, January 14\u201319). Beyond short-term snippet: Video relation detection with spatio-temporal global context. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online.","DOI":"10.1109\/CVPR42600.2020.01085"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., and Chua, T.S. (2019, January 10\u201313). Annotating objects and relations in user-generated videos. Proceedings of the International Conference on Multimedia Retrieval, Ottawa, ON, Canada.","DOI":"10.1145\/3323873.3325056"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Dai, B., Zhang, Y., and Lin, D. (2017, January 21\u201326). Detecting visual relationships with deep relational networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.352"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Xu, D., Zhu, Y., Choy, C.B., and Li, F.F. (2017, January 21\u201326). Scene graph generation by iterative message passing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.330"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Li, Y., Ouyang, W., Zhou, B., Wang, K., and Wang, X. (2017, January 22\u201327). Scene graph generation from objects, phrases and region captions. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.142"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yang, J., Lu, J., Lee, S., Batra, D., and Parikh, D. (2018, January 8\u201314). Graph R-CNN for scene graph generation. Proceedings of the Europa Conference Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_41"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zellers, R., YatFBar, M., Thomson, S., and Choi, Y. (2018, January 18\u201322). Neural motifs: Scene graph parsing with global context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake, UT, USA.","DOI":"10.1109\/CVPR.2018.00611"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft COCO: Common objects in context. Proceedings of the Europa Conference Computer Vision (ECCV), Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 1137\u20131149.","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"ref_18","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lasvegas, NV, USA."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., and Bernstein, M.S. (2015). Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 211\u2013252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wojke, N., Bewley, A., and Paulus, D. (2017, January 17\u201320). Simple online and realtime tracking with a deep association metric. Proceedings of the IEEE international conference on image processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296962"},{"key":"ref_21","unstructured":"Kipf, T.N., and Welling, M. (2017, January 24\u201326). Semi-supervised classification with graph convolutional networks. Proceedings of the 5th International Conference on Learning Representations, Toulon, France."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/9\/3164\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:56:45Z","timestamp":1760162205000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/9\/3164"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,2]]},"references-count":21,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["s21093164"],"URL":"https:\/\/doi.org\/10.3390\/s21093164","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,5,2]]}}}