{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T04:11:00Z","timestamp":1780459860892,"version":"3.54.1"},"reference-count":41,"publisher":"National Library of Serbia","issue":"3","license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["ComSIS","COMPUT SCI INF SYST","COMPUT SCI INFORM SY","COMPUTER SCI INFORM","COMSIS J"],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:p>Recent work has demonstrated the Transformer model is effective for computer vision tasks. However, the global self-attention mechanism utilized in Transformer models does not adequately consider the local structure and details of images, which may result in the loss of information and local details, causing decreased estimation accuracy in gaze estimation tasks when compared to convolution or sequential stacking methods. To address this issue, we propose a parallel CNNs-Transformer aggregation network (CTA-Net) for gaze estimation, which fully leverages the advantages of the Transformer model in modeling global context while the convolutional neural networks (CNNs) model in retaining local details. Specifically, Transformer and ResNet are deployed to extract facial and eye information, respectively. Additionally, an attention cross fusion (ACFusion) Block is embedded with CNN branch, which decomposes features in space and channels to supplement lost features, suppress noise, and help extract eye features more effectively. Finally, a dual-feature aggregation (DFA) module is proposed to effectively fuse the output features of both branches with the help feature a selection mechanism and a residual structure. Experimental results on the MPIIGaze and Gaze360 datasets demonstrate that our CTA-Net achieves state-of-the-art results.<\/jats:p>","DOI":"10.2298\/csis231116020x","type":"journal-article","created":{"date-parts":[[2024,5,14]],"date-time":"2024-05-14T08:23:20Z","timestamp":1715675000000},"page":"831-850","source":"Crossref","is-referenced-by-count":4,"title":["CTA-Net: A gaze estimation network based on dual feature aggregation and attention cross fusion"],"prefix":"10.2298","volume":"21","author":[{"given":"Chenxing","family":"Xia","sequence":"first","affiliation":[{"name":"College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan China + Institute of Energy, Hefei Comprehensive National Science Center, Hefei, China + Anhui Purvar Bigdata Technology Co. Ltd Huainan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhanpeng","family":"Tao","sequence":"additional","affiliation":[{"name":"College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"Anyang Cigarette Factory, China Tobacco Henan Industrial Co, Anyang, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wenjun","family":"Zhao","sequence":"additional","affiliation":[{"name":"College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bin","family":"Ge","sequence":"additional","affiliation":[{"name":"College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiuju","family":"Gao","sequence":"additional","affiliation":[{"name":"Anyang Cigarette Factory, China Tobacco Henan Industrial Co, Anyang, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kuan-Ching","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yan","family":"Zhang","sequence":"additional","affiliation":[{"name":"The School of Electronics and Information Engineering, Anhui University, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1078","reference":[{"key":"ref1","unstructured":"Biswas, P., et al.: Appearance-based gaze estimation using attention and difference mechanism. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3143-3152 (2021)"},{"key":"ref2","unstructured":"Cai, X., Chen, B., Zeng, J., Zhang, J., Sun, Y.,Wang, X., Ji, Z., Liu, X., Chen, X., Shan, S.: Gaze estimation with an ensemble of four architectures. arXiv preprint arXiv:2107.01980 (2021)"},{"key":"ref3","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision. pp. 213-229. Springer (2020)","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref4","doi-asserted-by":"crossref","unstructured":"Chen, Z., Shi, B.E.: Appearance-based gaze estimation using dilated-convolutions. In: Proceedings of the Asian Conference on Computer Vision. pp. 309-324. Springer (2018)","DOI":"10.1007\/978-3-030-20876-9_20"},{"key":"ref5","doi-asserted-by":"crossref","unstructured":"Cheng, Y., Huang, S., Wang, F., Qian, C., Lu, F.: A coarse-to-fine adaptive network for appearance-based gaze estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 10623-10630 (2020)","DOI":"10.1609\/aaai.v34i07.6636"},{"key":"ref6","doi-asserted-by":"crossref","unstructured":"Cheng, Y., Lu, F.: Gaze estimation using transformer. arXiv preprint arXiv:2105.14424 (2021)","DOI":"10.1109\/ICPR56361.2022.9956687"},{"key":"ref7","unstructured":"Cheng, Y., Wang, H., Bao, Y., Lu, F.: Appearance-based gaze estimation with deep learning: A review and benchmark. arXiv preprint arXiv:2104.12668 (2021)"},{"key":"ref8","doi-asserted-by":"crossref","unstructured":"Cheng, Y., Zhang, X., Lu, F., Sato, Y.: Gaze estimation by exploring two-eye asymmetry. IEEE Transactions on Image Processing 29, 5259-5272 (2020)","DOI":"10.1109\/TIP.2020.2982828"},{"key":"ref9","doi-asserted-by":"crossref","unstructured":"Cristina, S., Camilleri, K.P.: Model-based head pose-free gaze estimation for assistive communication. Computer Vision and Image Understanding 149, 157-170 (2016)","DOI":"10.1016\/j.cviu.2016.02.012"},{"key":"ref10","doi-asserted-by":"crossref","unstructured":"Dari, S., Kadrileev, N., H\u00fcllermeier, E.: A neural network-based driver gaze classification system with vehicle signals. In: Proceedings of the International Joint Conference on Neural Networks. pp. 1-7. IEEE (2020)","DOI":"10.1109\/IJCNN48605.2020.9207709"},{"key":"ref11","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)"},{"key":"ref12","doi-asserted-by":"crossref","unstructured":"Fischer, T., Chang, H.J., Demiris, Y.: Rt-gene: Real-time eye gaze estimation in natural environments. In: Proceedings of the European Conference on Computer Vision. pp. 334-352 (2018)","DOI":"10.1007\/978-3-030-01249-6_21"},{"key":"ref13","doi-asserted-by":"crossref","unstructured":"Hansen, D.W., Ji, Q.: In the eye of the beholder: A survey of models for eyes and gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3), 478-500 (2009)","DOI":"10.1109\/TPAMI.2009.30"},{"key":"ref14","unstructured":"Huang, C.H., Wu, H.Y., Lin, Y.L.: Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:2101.07172 (2021)"},{"key":"ref15","doi-asserted-by":"crossref","unstructured":"Huang, H., Ren, L., Yang, Z., Zhan, Y., Zhang, Q., Lv, J.: Gazeattentionnet: Gaze estimation with attentions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 2435-2439. IEEE (2022)","DOI":"10.1109\/ICASSP43922.2022.9747911"},{"key":"ref16","doi-asserted-by":"crossref","unstructured":"Huang, Q., Veeraraghavan, A., Sabharwal, A.: Tabletgaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Machine Vision and Applications 28, 445-461 (2017)","DOI":"10.1007\/s00138-017-0852-4"},{"key":"ref17","doi-asserted-by":"crossref","unstructured":"Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., Torralba, A.: Gaze360: Physically unconstrained gaze estimation in the wild. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6912-6921 (2019)","DOI":"10.1109\/ICCV.2019.00701"},{"key":"ref18","doi-asserted-by":"crossref","unstructured":"Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2176-2184 (2016)","DOI":"10.1109\/CVPR.2016.239"},{"key":"ref19","unstructured":"Liu, G., Yu, Y., Mora, K.A.F., Odobez, J.M.: A differential approach for gaze estimation with calibration. In: Proceedings of the British Machine Vision Conference. vol. 2, p. 6 (2018)"},{"key":"ref20","doi-asserted-by":"crossref","unstructured":"Lu, F., Chen, X., Sato, Y.: Appearance-based gaze estimation via uncalibrated gaze pattern recovery. IEEE Transactions on Image Processing 26(4), 1543-1553 (2017)","DOI":"10.1109\/TIP.2017.2657880"},{"key":"ref21","doi-asserted-by":"crossref","unstructured":"Lu, F., Sugano, Y., Okabe, T., Sato, Y.: Adaptive linear regression for appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(10), 2033- 2046 (2014)","DOI":"10.1109\/TPAMI.2014.2313123"},{"key":"ref22","unstructured":"Mora, K.A.F., Odobez, J.M.: Person independent 3d gaze estimation from remote rgb-d cameras. In: Proceedings of the IEEE International Conference on Image Processing. pp. 2787- 2791. IEEE (2013)"},{"key":"ref23","doi-asserted-by":"crossref","unstructured":"Morimoto, C.H., Mimica, M.R.: Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding 98(1), 4-24 (2005)","DOI":"10.1016\/j.cviu.2004.07.010"},{"key":"ref24","doi-asserted-by":"crossref","unstructured":"Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: Boundary-aware salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7479-7489 (2019)","DOI":"10.1109\/CVPR.2019.00766"},{"key":"ref25","doi-asserted-by":"crossref","unstructured":"Stampe, D.M.: Heuristic filtering and reliable calibration methods for video-based pupiltracking systems. Behavior Research Methods, Instruments, & Computers 25, 137-142 (1993)","DOI":"10.3758\/BF03204486"},{"key":"ref26","doi-asserted-by":"crossref","unstructured":"Sugano, Y., Matsushita, Y., Sato, Y.: Appearance-based gaze estimation using visual saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(2), 329-341 (2012)","DOI":"10.1109\/TPAMI.2012.101"},{"key":"ref27","doi-asserted-by":"crossref","unstructured":"Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al.: Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)","DOI":"10.1109\/TPAMI.2022.3206148"},{"key":"ref28","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., Polosukhin, I.: Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (2017)"},{"key":"ref29","doi-asserted-by":"crossref","unstructured":"Wang, K., Zhao, R., Su, H., Ji, Q.: Generalizing eye tracking with bayesian adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11907-11916 (2019)","DOI":"10.1109\/CVPR.2019.01218"},{"key":"ref30","doi-asserted-by":"crossref","unstructured":"Wang, S., Ouyang, X., Liu, T., Wang, Q., Shen, D.: Follow my eye: using gaze to supervise computer-aided diagnosis. IEEE Transactions on Medical Imaging 41(7), 1688-1698 (2022)","DOI":"10.1109\/TMI.2022.3146973"},{"key":"ref31","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7794-7803 (2018)","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref32","doi-asserted-by":"crossref","unstructured":"Wang, Z., Wang, H., Yu, H., Lu, F.: Interaction with gaze, gesture, and speech in a flexibly configurable augmented reality system. IEEE Transactions on Human-Machine Systems 51(5), 524-534 (2021)","DOI":"10.1109\/THMS.2021.3097973"},{"key":"ref33","doi-asserted-by":"crossref","unstructured":"Williams, O., Blake, A., Cipolla, R.: Sparse and semi-supervised visual mapping with the s\u02c6 3gp. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. vol. 1, pp. 230-237. IEEE (2006)","DOI":"10.1109\/CVPR.2006.285"},{"key":"ref34","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision. pp. 3-19 (2018)","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref35","doi-asserted-by":"crossref","unstructured":"Xiong, Y., Kim, H.J., Singh, V.: Mixed effects neural networks (menets) with applications to gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7743-7752 (2019)","DOI":"10.1109\/CVPR.2019.00793"},{"key":"ref36","doi-asserted-by":"crossref","unstructured":"Yang, B., Cui, J., Tong, Y., Wang, L., Zha, H.: Recognition of infants\u2019 gaze behaviors and emotions. In: Proceedings of the International Conference on Pattern Recognition. pp. 3204- 3209. IEEE (2018)","DOI":"10.1109\/ICPR.2018.8545766"},{"key":"ref37","doi-asserted-by":"crossref","unstructured":"Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., Hilliges, O.: Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Proceedings of the European Conference on Computer Vision. pp. 365-381. Springer (2020)","DOI":"10.1007\/978-3-030-58558-7_22"},{"key":"ref38","doi-asserted-by":"crossref","unstructured":"Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4511-4520 (2015)","DOI":"10.1109\/CVPR.2015.7299081"},{"key":"ref39","doi-asserted-by":"crossref","unstructured":"Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: It\u2019s written all over your face: Full-face appearance-based gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 51-60 (2017)","DOI":"10.1109\/CVPRW.2017.284"},{"key":"ref40","doi-asserted-by":"crossref","unstructured":"Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(1), 162-175 (2017)","DOI":"10.1109\/TPAMI.2017.2778103"},{"key":"ref41","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6881-6890 (2021)","DOI":"10.1109\/CVPR46437.2021.00681"}],"container-title":["Computer Science and Information Systems"],"original-title":[],"language":"en","deposited":{"date-parts":[[2024,11,18]],"date-time":"2024-11-18T18:44:32Z","timestamp":1731955472000},"score":1,"resource":{"primary":{"URL":"https:\/\/doiserbia.nb.rs\/Article.aspx?ID=1820-02142400020X"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":41,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024]]}},"URL":"https:\/\/doi.org\/10.2298\/csis231116020x","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3377315\/v1","asserted-by":"object"}]},"ISSN":["1820-0214","2406-1018"],"issn-type":[{"value":"1820-0214","type":"print"},{"value":"2406-1018","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024]]}}}