{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,24]],"date-time":"2025-09-24T00:15:30Z","timestamp":1758672930416,"version":"3.44.0"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:p>Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators.\n\n\n\nWe introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291\u00d7 performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200\u00d7 compared with human experts.<\/jats:p>","DOI":"10.24963\/ijcai.2025\/783","type":"proceedings-article","created":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T08:10:40Z","timestamp":1758269440000},"page":"7038-7046","source":"Crossref","is-referenced-by-count":0,"title":["QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives"],"prefix":"10.24963","author":[{"given":"Xuzhi","family":"Zhang","sequence":"first","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Shaohui","family":"Peng","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"}]},{"given":"Qirui","family":"Zhou","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Yuanbo","family":"Wen","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences"}]},{"given":"Qi","family":"Guo","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Ruizhi","family":"Chen","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"}]},{"given":"Xinguo","family":"Zhu","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Weiqiang","family":"Xiong","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Haixin","family":"Chen","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Congying","family":"Ma","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"Peking University"}]},{"given":"Ke","family":"Gao","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"}]},{"given":"Chen","family":"Zhao","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"}]},{"given":"Yanjun","family":"Wu","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Yunji","family":"Chen","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]},{"given":"Ling","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Software Chinese Academy of Sciences"},{"name":"University of Chinese Academy of Sciences"}]}],"member":"10584","event":{"number":"34","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2025","name":"Thirty-Fourth International Joint Conference on Artificial Intelligence {IJCAI-25}","start":{"date-parts":[[2025,8,16]]},"theme":"Artificial Intelligence","location":"Montreal, Canada","end":{"date-parts":[[2025,8,22]]}},"container-title":["Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T11:35:08Z","timestamp":1758627308000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2025\/783"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2025,9]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2025\/783","relation":{},"subject":[],"published":{"date-parts":[[2025,9]]}}}