{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:31:13Z","timestamp":1750221073776,"version":"3.41.0"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2019,4,30]],"date-time":"2019-04-30T00:00:00Z","timestamp":1556582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100011002","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61472422,61332016"],"award-info":[{"award-number":["61472422,61332016"]}],"id":[{"id":"10.13039\/501100011002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,4,30]]},"abstract":"<jats:p>Bilinear models are very powerful in multimodal fusion tasks like Visual Question Answering. The predominant bilinear methods can all be seen as a kind of tensor-based decomposition operation that contains a key kernel called \u201ccore tensor.\u201d Current approaches usually focus on reducing the computation complexity by applying low-rank constraint on the core tensor. In this article, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP), which not only maintains the advantages of previous bilinear methods but also conducts sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor, which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such a block-diagonal core tensor is equivalent to conducting many \u201ctiny\u201d bilinear operations in different feature spaces. Thus, introducing sparsity into the bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What is more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.<\/jats:p>","DOI":"10.1145\/3282469","type":"journal-article","created":{"date-parts":[[2019,7,3]],"date-time":"2019-07-03T13:47:53Z","timestamp":1562161673000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["BTDP"],"prefix":"10.1145","volume":"15","member":"320","published-online":{"date-parts":[[2019,7,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_2_1","volume-title":"Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705","author":"Andreas Jacob","year":"2016","unstructured":"Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)","author":"Hedi","year":"2017","unstructured":"Hedi Ben-younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917). 3."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02310791"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/646255.684566"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3177745"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1137\/060661685"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1137\/070690729"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1137\/070690730"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2501643.2501646"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969496"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.41"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_2_1_19_1","volume-title":"\u201cexplanatory","author":"Harshman Richard A","year":"1970","unstructured":"Richard A Harshman. 1970. Foundations of the PARAFAC procedure: Models and conditions for an \u201cexplanatory\u201d multimodal factor analysis. UCLA Working Papers in Phonetics 16 (1970)."},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"He K.","key":"e_1_2_1_20_1","unstructured":"K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 770--778."},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural Comput. (1997) 1735--1780.","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2614132"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2710635"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043612.2043613"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2011.53"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487736"},{"key":"e_1_2_1_27_1","volume-title":"A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485","author":"Ilievski Ilija","year":"2016","unstructured":"Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016)."},{"volume-title":"Proceedings of the European Conference on Computer Vision. Springer, 727--739","author":"Jabri Allan","key":"e_1_2_1_28_1","unstructured":"Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. Springer, 727--739."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969038"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Kim Jin-Hwa","year":"2017","unstructured":"Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201914)","author":"Diederik","key":"e_1_2_1_31_1","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR\u201914)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969607"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1137\/07070111X"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_36_1","first-page":"1995","article-title":"Convolutional networks for images, speech, and time series","volume":"3361","author":"LeCun Yann","year":"1995","unstructured":"Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. Handbook Brain Theory Neural Netw. 3361, 10 (1995), 1995.","journal-title":"Handbook Brain Theory Neural Netw."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2624140"},{"key":"e_1_2_1_38_1","volume-title":"Deep collaborative embedding for social image understanding","author":"Li Zechao","year":"2018","unstructured":"Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2018)."},{"volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201914)","author":"Lin Tsung-Yi","key":"e_1_2_1_39_1","unstructured":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV\u201914). Springer, 740--755."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.170"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3187011"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.5555\/2968826.2969014"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2071396.2071402"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_2_1_45_1","volume-title":"Training recurrent answering units with joint loss minimization for VQA. arXiv preprint arXiv:1606.03647","author":"Noh Hyeonwoo","year":"2016","unstructured":"Hyeonwoo Noh and Bohyung Han. 2016. Training recurrent answering units with joint loss minimization for VQA. arXiv preprint arXiv:1606.03647 (2016)."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818711"},{"key":"e_1_2_1_47_1","volume-title":"NIPS 2017 Autodiff Workshop.","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. NIPS 2017 Autodiff Workshop."},{"volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","author":"Pennington Jeffrey","key":"e_1_2_1_48_1","unstructured":"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). 1532--1543."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487591"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294996.3295124"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806216"},{"key":"e_1_2_1_52_1","volume-title":"Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034","author":"Simonyan Karen","year":"2013","unstructured":"Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)."},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201915)","author":"Szegedy C.","key":"e_1_2_1_53_1","unstructured":"C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201915). 1--9."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2998574"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1162\/089976600300015349"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Teney Damien","key":"e_1_2_1_56_1","unstructured":"Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2018. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02289464"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3115432"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.5555\/3171642.3171825"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.05.001"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.283"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.202"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2817340"},{"volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Matthew","key":"e_1_2_1_65_1","unstructured":"Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV\u201918). Springer, 818--833."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.542"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.5555\/3304222.3304281"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123364"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.5555\/3172077.3172381"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.5555\/3304222.3304280"},{"key":"e_1_2_1_71_1","volume-title":"Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167","author":"Zhou Bolei","year":"2015","unstructured":"Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.540"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3282469","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3282469","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:57:29Z","timestamp":1750208249000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3282469"}},"subtitle":["Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering"],"editor":[{"given":"Zhiwei","family":"Fang","sequence":"first","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Jing","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Xueliang","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Qu","family":"Tang","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Yong","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Hanqing","family":"Lu","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2019,4,30]]},"references-count":72,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2019,4,30]]}},"alternative-id":["10.1145\/3282469"],"URL":"https:\/\/doi.org\/10.1145\/3282469","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,4,30]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}