{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T21:25:21Z","timestamp":1773091521195,"version":"3.50.1"},"reference-count":71,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T00:00:00Z","timestamp":1739232000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations in the textures, different modalities, and weak visual features exist in the images, leading to limited applications on real-world tasks. We introduce Self-Supervised Foundation Model for Template Matching (Self-TM), a novel end-to-end approach to self-supervised learning template matching. The idea behind Self-TM is to learn hierarchical features incorporating localization properties from images without any annotations. As going deeper in the convolutional neural network (CNN) layers, their filters begin to react to more complex structures and their receptive fields increase. This leads to loss of localization information in contrast to the early layers. The hierarchical propagation of the last layers back to the first layer results in precise template localization. Due to its zero-shot generalization capabilities on tasks such as image retrieval, dense template matching, and sparse image matching, our pre-trained model can be classified as a foundation one.<\/jats:p>","DOI":"10.3390\/bdcc9020038","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T11:01:08Z","timestamp":1739271668000},"page":"38","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Self-Supervised Foundation Model for Template Matching"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5806-1848","authenticated-orcid":false,"given":"Anton","family":"Hristov","sequence":"first","affiliation":[{"name":"Faculty of Mathematics and Informatics, Sofia University \u201cSt. Kliment Ohridski\u201d, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8203-6706","authenticated-orcid":false,"given":"Dimo","family":"Dimov","sequence":"additional","affiliation":[{"name":"Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Block 2, 1113 Sofia, Bulgaria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9917-9535","authenticated-orcid":false,"given":"Maria","family":"Nisheva-Pavlova","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Informatics, Sofia University \u201cSt. Kliment Ohridski\u201d, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria"},{"name":"Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Block 8, 1113 Sofia, Bulgaria"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"ref_1","first-page":"1597","article-title":"A simple framework for contrastive learning of visual representations","volume":"1","author":"Chen","year":"2020","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2019, January 15\u201320). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. (2022, January 18\u201324). Masked autoencoders are scalable vision learners. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Jang, J., Kim, S., Yoo, K., Kong, C., Kim, J., and Kwak, N. (2023, January 2\u20137). Self-Distilled Self-Supervised Representation Learning. Proceedings of the 2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.","DOI":"10.1109\/WACV56688.2023.00285"},{"key":"ref_5","unstructured":"Kalapos, A., and Gyires-T\u00f3th, B. (2024). CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture. arXiv."},{"key":"ref_6","first-page":"1","article-title":"RingMo-Lite: A remote sensing lightweight network with CNN-Transformer hybrid framework","volume":"62","author":"Wang","year":"2024","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H.S. (2016). Fully-Convolutional Siamese networks for object tracking. Computer Vision\u2014ECCV 2016 Workshops. ECCV 2016, Springer. Lecture notes in computer science.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201323). High performance visual tracking with Siamese Region Proposal network. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"He, A., Luo, C., Tian, X., and Zeng, W. (2018, January 18\u201323). A twofold siamese network for real-time object tracking. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00508"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., and Torr, P.H.S. (2017, January 21\u201326). End-to-end representation learning for correlation filter based tracking. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.531"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2018). SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. arXiv.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., and Hu, W. (2018). Distractor-Aware Siamese networks for visual object tracking. Computer Vision\u2014ECCV 2018. ECCV 2018, Springer. Lecture notes in computer science.","DOI":"10.1007\/978-3-030-01240-3_7"},{"key":"ref_13","unstructured":"Fan, H., and Ling, H. (2018, January 18\u201323). Siamese cascaded region proposal networks for real-time visual tracking. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R., and Yang, M.-H. (2017, January 22\u201329). CREST: Convolutional Residual Learning for Visual Tracking. Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.279"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Guo, D., Wang, J., Cui, Y., Wang, Z., and Chen, S. (2020, January 13\u201319). SIAMCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00630"},{"key":"ref_16","first-page":"3072","article-title":"SiamMask: A framework for fast online object tracking and segmentation","volume":"45","author":"Hu","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022, January 18\u201324). A ConvNet for the 2020s. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"ref_18","unstructured":"Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., and Beyer, L. (2021). How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_20","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_21","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_22","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2020, January 13\u201318). Training data-efficient image transformers & distillation through attention. Proceedings of the International Conference on Machine Learning 2020, Virtual Event."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hisham, M.B., Yaakob, S.N., Raof, R.A.A., Nazren, A.B.A., and Wafi, N.M. (2015, January 13\u201314). Template matching using sum of squared difference and normalized cross correlation. Proceedings of the IEEE Student Conference on Research and Development (SCOReD) 2015, Kuala Lumpur, Malaysia.","DOI":"10.1109\/SCORED.2015.7449303"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Niitsuma, H., and Maruyama, T. (September, January 31). Sum of absolute difference implementations for image processing on FPGAs. Proceedings of the 2010 International Conference on Field Programmable Logic and Applications, Milan, Italy.","DOI":"10.1109\/FPL.2010.40"},{"key":"ref_25","unstructured":"Papageorgiou, C.P., Oren, M., and Poggio, T. (1998, January 7). A general framework for object detection. Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"2129","DOI":"10.1016\/j.patrec.2005.03.022","article-title":"ZNCC-based template matching using bounded partial correlation","volume":"26","author":"Mattoccia","year":"2005","journal-title":"Pattern Recognit. Lett."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","article-title":"Distinctive Image Features from Scale-Invariant Keypoints","volume":"60","author":"Lowe","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF: Speeded up robust features. Computer Vision\u2014ECCV 2006. ECCV 2006, Springer.","DOI":"10.1007\/11744023_32"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011, January 6\u201313). ORB: An efficient alternative to SIFT or SURF. Proceedings of the International Conference on Computer Vision 2011, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126544"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"DeTone, D., Malisiewicz, T., and Rabinovich, A. (2018, January 18\u201322). SuperPoint: Self-supervised interest point detection and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00060"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, A. (2020, January 13\u201319). SuperGlue: Learning feature matching with graph neural networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00499"},{"key":"ref_32","first-page":"2292","article-title":"Sinkhorn Distances: Lightspeed computation of optimal transport","volume":"26","author":"Cuturi","year":"2013","journal-title":"Neural Inf. Process. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. (2023, January 1\u20136). LightGlue: Local feature matching at light speed. Proceedings of the International Conference on Computer Vision 2023, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01616"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Jiang, H., Karpur, A., Cao, B., Huang, Q., and Araujo, A. (2024, January 16\u201322). OmniGlue: Generalizable feature matching with foundation model guidance. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01878"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 10\u201317). Emerging Properties in Self-Supervised Vision Transformers. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Chen, H., Luo, Z., Zhou, L., Tian, Y., Zhen, M., Fang, T., McKinnon, D., Tsin, Y., and Quan, L. (2022). ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. European Conference on Computer Vision, Springer. Lecture notes in computer science.","DOI":"10.1007\/978-3-031-19824-3_2"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Sun, J., Shen, Z., Wang, Y., Bao, H., and Zhou, X. (2021, January 20\u201325). LOFTR: Detector-Free local feature matching with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00881"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Edstedt, J., Athanasiadis, I., Wadenb\u00e4ck, M., and Felsberg, M. (2023, January 17\u201324). DKM: Dense Kernelized Feature Matching for geometry estimation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01704"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"10247","DOI":"10.1109\/TPAMI.2023.3249225","article-title":"PDC-NET+: Enhanced Probabilistic Dense Correspondence Network","volume":"45","author":"Truong","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_40","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., and El-Nouby, A. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv."},{"key":"ref_41","unstructured":"Van Den Oord, A., Li, Y., and Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., and Lo, W.-Y. (2023, January 1\u20136). Segment anything. Proceedings of the EEE\/CVF International Conference on Computer Vision 2023, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"ref_43","unstructured":"Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R\u00e4dle, R., Rolland, C., and Gustafson, L. (2024). SAM 2: Segment anything in images and videos. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., and Huang, T. (2023, January 1\u20136). SegGPT: Towards segmenting everything in context. Proceedings of the IEEE\/CVF International Conference on Computer Vision 2023, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00110"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Wang, X., Wang, W., Cao, Y., Shen, C., and Huang, T. (2023, January 17\u201324). Images speak in images: A generalist painter for in-context visual Learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00660"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Balntas, V., Lenc, K., Vedaldi, A., and Mikolajczyk, K. (2017, January 21\u201326). HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.410"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Li, Z., and Snavely, N. (2018, January 18\u201322). MegaDepth: Learning single-view depth prediction from internet photos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00218"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., and Niessner, M. (2017, January 21\u201326). ScanNet: Richly-annotated 3D reconstructions of indoor scenes. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.261"},{"key":"ref_49","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_50","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_52","unstructured":"Tan, M., and Le, Q.V. (2019, January 9\u201315). EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning 2019, Long Beach, CA, USA."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., and Dollar, P. (2020, January 13\u201319). Designing network design spaces. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"ref_54","unstructured":"Tan, M., and Le, Q. (2021, January 18\u201324). EfficientNetV2: Smaller models and faster training. Proceedings of the International Conference on Machine Learning 2021, Virtual."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. (2020). Big Transfer (BIT): General visual representation learning. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58558-7_29"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, N.K., and Fei-Fei, N.L. (2009, January 20\u201325). ImageNet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. (2023, January 17\u201324). Self-Supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01499"},{"key":"ref_58","first-page":"8799","article-title":"VICRegL: Self-supervised learning of local visual features","volume":"35","author":"Bardes","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_59","first-page":"211","article-title":"Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters","volume":"2","author":"Bridle","year":"1989","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_60","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"136062","DOI":"10.1109\/ACCESS.2019.2940737","article-title":"Twin-Net descriptor: Twin negative mining with quad loss for Patch-Based matching","volume":"7","author":"Irshad","year":"2019","journal-title":"IEEE Access"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Schonberger, J.L., and Frahm, J.-M. (2016, January 27\u201330). Structure-from-motion revisited. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.445"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Fischler, M.A., and Bolles, R.C. (1987). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Readings in Computer Vision, Elsevier. Elsevier eBooks.","DOI":"10.1016\/B978-0-08-051581-6.50070-2"},{"key":"ref_64","unstructured":"Ioffe, S. (2017, January 4\u20139). Batch renormalization: Towards reducing minibatch dependence in Batch-Normalized models. Proceedings of the Advances in Neural Information Processing Systems 2017, Long Beach, CA, USA."},{"key":"ref_65","unstructured":"Aharon, N., Orfaig, R., and Bobrovsky, B.-Z. (2022). BOT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., and Wang, X. (2022). ByteTrack: Multi-object tracking by associating every detection box. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-031-20047-2_1"},{"key":"ref_67","unstructured":"Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., and Langlotz, C. (2020). Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv."},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Huang, S.-C., Shen, L., Lungren, M.P., and Yeung, S. (2021, January 10\u201317). GLORIA: A multimodal Global-Local Representation learning framework for label-efficient medical image recognition. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00391"},{"key":"ref_69","unstructured":"Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., and Yu, L. (2022). Multi-Granularity cross-modal alignment for generalized medical visual representation learning. arXiv."},{"key":"ref_70","unstructured":"Liu, C., Ouyang, C., Cheng, S., Shah, A., Bai, W., and Arcucci, R. (2023). G2D: From global to Dense Radiography Representation Learning via Vision-Language Pre-training. arXiv."},{"key":"ref_71","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1109\/TMI.2024.3449690","article-title":"IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training","volume":"44","author":"Liu","year":"2025","journal-title":"IEEE Trans. Med. Imaging"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/2\/38\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:31:12Z","timestamp":1760027472000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/2\/38"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,11]]},"references-count":71,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["bdcc9020038"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9020038","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,11]]}}}