{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T07:38:10Z","timestamp":1769758690315,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100002855","name":"Ministry of Science and Technology of the People's Republic of China","doi-asserted-by":"publisher","award":["2022ZD0116309"],"award-info":[{"award-number":["2022ZD0116309"]}],"id":[{"id":"10.13039\/501100002855","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,12,9]]},"DOI":"10.1145\/3743093.3770949","type":"proceedings-article","created":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T08:06:16Z","timestamp":1765008376000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Mixture of Group Experts for Learning Invariant Representations"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4512-3296","authenticated-orcid":false,"given":"Lei","family":"Kang","sequence":"first","affiliation":[{"name":"Beijing Normal University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2956-2846","authenticated-orcid":false,"given":"Jia","family":"Li","sequence":"additional","affiliation":[{"name":"Beijing Normal University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3777-7916","authenticated-orcid":false,"given":"Mi","family":"Tian","sequence":"additional","affiliation":[{"name":"TAL Education Group, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2587-1702","authenticated-orcid":false,"given":"Hua","family":"Huang","sequence":"additional","affiliation":[{"name":"Beijing Normal University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,6]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"crossref","unstructured":"R\u00f3bert Csord\u00e1s Piotr Pi\u0119kos Kazuki Irie and J\u00fcrgen Schmidhuber. 2024. Switchhead: Accelerating Transformers with mixture-of-experts attention. Advances in Neural Information Processing Systems 37 (2024) 74411\u201374438.","DOI":"10.52202\/079017-2368"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"crossref","unstructured":"David\u00a0L Donoho and Michael Elad. 2003. Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences 100 5 (2003) 2197\u20132202.","DOI":"10.1073\/pnas.0437847100"},{"key":"e_1_3_3_2_4_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et\u00a0al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"crossref","unstructured":"Yonina\u00a0C Eldar Patrick Kuppinger and Helmut Bolcskei. 2010. Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Transactions on Signal Processing 58 6 (2010) 3042\u20133054.","DOI":"10.1109\/TSP.2010.2044837"},{"key":"e_1_3_3_2_6_2","unstructured":"William Fedus Barret Zoph and Noam Shazeer. 2022. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23 120 (2022) 1\u201339."},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"crossref","unstructured":"Jinyuan Feng Zhiqiang Pu Tianyi Hu Dongmin Li Xiaolin Ai and Huimu Wang. 2025. OMoE: Diversifying mixture of low-rank adaptation by orthogonal finetuning. arXiv:https:\/\/arXiv.org\/abs\/2501.10062 (2025).","DOI":"10.3233\/FAIA251350"},{"key":"e_1_3_3_2_8_2","unstructured":"Yaohua Hu Chong Li Kaiwen Meng Jing Qin and Xiaoqi Yang. 2017. Group sparse optimization via lp q regularization. Journal of Machine Learning Research 18 30 (2017) 1\u201352."},{"key":"e_1_3_3_2_9_2","unstructured":"Changho Hwang Wei Cui Yifan Xiong Ziyue Yang Ze Liu Han Hu Zilong Wang Rafael Salas Jithin Jose Prabhat Ram et\u00a0al. 2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023) 269\u2013287."},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Aapo Hyv\u00e4rinen and Patrik\u00a0O Hoyer. 2001. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research 41 18 (2001) 2413\u20132423.","DOI":"10.1016\/S0042-6989(01)00114-6"},{"key":"e_1_3_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Aapo Hyv\u00e4rinen and Urs K\u00f6ster. 2007. Complex cell pooling and the statistics of natural images. Network: Computation in Neural Systems 18 2 (2007) 81\u2013100.","DOI":"10.1080\/09548980701418942"},{"key":"e_1_3_3_2_12_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Emma\u00a0Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv:https:\/\/arXiv.org\/abs\/2401.04088 (2024)."},{"key":"e_1_3_3_2_13_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom\u00a0B Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv:https:\/\/arXiv.org\/abs\/2001.08361 (2020)."},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206545"},{"key":"e_1_3_3_2_15_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Komatsuzaki Aran","year":"2023","unstructured":"Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos\u00a0Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023. Sparse Upcycling: Training mixture-of-experts from dense checkpoints. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_2_16_2","unstructured":"Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009)."},{"key":"e_1_3_3_2_17_2","unstructured":"Yann Le and Xuan Yang. 2015. Tiny ImageNet visual recognition challenge. CS 231N 7 7 (2015) 3."},{"key":"e_1_3_3_2_18_2","unstructured":"Seung\u00a0Hoon Lee Seunghyun Lee and Byung\u00a0Cheol Song. 2021. Vision Transformer for small-size datasets. arXiv:https:\/\/arXiv.org\/abs\/2112.13492 (2021)."},{"key":"e_1_3_3_2_19_2","volume-title":"Proceedings of International Conference on Learning Representations","author":"Lepikhin Dmitry","year":"2021","unstructured":"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling giant models with conditional computation and automatic sharding. In Proceedings of International Conference on Learning Representations."},{"key":"e_1_3_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"e_1_3_3_2_23_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Merity Stephen","year":"2017","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_2_24_2","unstructured":"Carlos Riquelme Joan Puigcerver Basil Mustafa Maxim Neumann Rodolphe Jenatton Andr\u00e9 Susano\u00a0Pinto Daniel Keysers and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021) 8583\u20138595."},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein Alexander\u00a0C. Berg and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (2015) 211\u2013252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_3_2_26_2","volume-title":"Proceedings of International Conference on Learning Representations","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of International Conference on Learning Representations."},{"key":"e_1_3_3_2_27_2","doi-asserted-by":"crossref","unstructured":"Manxi Sun Wei Liu Jian Luan Pengzhi Gao and Bin Wang. 2024. Mixture of diverse size experts. arXiv:https:\/\/arXiv.org\/abs\/2409.12210 (2024).","DOI":"10.18653\/v1\/2024.emnlp-industry.118"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Rachel\u00a0S Teo and Tan\u00a0M Nguyen. 2024. MomentumSMoE: Integrating momentum into sparse mixture of experts. Advances in Neural Information Processing Systems 37 (2024) 28965\u201329000.","DOI":"10.52202\/079017-0912"},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Joel\u00a0A Tropp. 2004. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory 50 10 (2004) 2231\u20132242.","DOI":"10.1109\/TIT.2004.834793"},{"key":"e_1_3_3_2_30_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)."},{"key":"e_1_3_3_2_31_2","unstructured":"Mathurin Videau Alessandro Leite Marc Schoenauer and Olivier Teytaud. 2024. Mixture of experts in image classification: What\u2019s the sweet spot?arXiv:https:\/\/arXiv.org\/abs\/2411.18322 (2024)."},{"key":"e_1_3_3_2_32_2","unstructured":"An Wang Xingwu Sun Ruobing Xie Shuaipeng Li Jiaqi Zhu Zhen Yang Pinxue Zhao JN Han Zhanhui Kang Di Wang et\u00a0al. 2024. HMoE: Heterogeneous mixture of experts for language modeling. arXiv:https:\/\/arXiv.org\/abs\/2408.10681 (2024)."},{"key":"e_1_3_3_2_33_2","doi-asserted-by":"crossref","unstructured":"Liwei Wang Yan Zhang and Jufu Feng. 2005. On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 8 (2005) 1334\u20131339.","DOI":"10.1109\/TPAMI.2005.165"},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"e_1_3_3_2_35_2","doi-asserted-by":"crossref","unstructured":"John Wright Yi Ma Julien Mairal Guillermo Sapiro Thomas\u00a0S Huang and Shuicheng Yan. 2010. Sparse representation for computer vision and pattern recognition. Proc. IEEE 98 6 (2010) 1031\u20131044.","DOI":"10.1109\/JPROC.2010.2044470"},{"key":"e_1_3_3_2_36_2","unstructured":"Han Xiao Kashif Rasul and Roland Vollgraf. 2017. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:https:\/\/arXiv.org\/abs\/1708.07747 (2017)."},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"crossref","unstructured":"Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68 1 (2006) 49\u201367.","DOI":"10.1111\/j.1467-9868.2005.00532.x"}],"event":{"name":"MMAsia '25: ACM Multimedia Asia","location":"Kuala Lumpur Malaysia","acronym":"MMAsia '25","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 7th ACM International Conference on Multimedia in Asia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3743093.3770949","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T08:10:50Z","timestamp":1765008650000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3743093.3770949"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,6]]},"references-count":36,"alternative-id":["10.1145\/3743093.3770949","10.1145\/3743093"],"URL":"https:\/\/doi.org\/10.1145\/3743093.3770949","relation":{},"subject":[],"published":{"date-parts":[[2025,12,6]]},"assertion":[{"value":"2025-12-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}