{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:10:53Z","timestamp":1750219853646,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":25,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,5,26]],"date-time":"2023-05-26T00:00:00Z","timestamp":1685059200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,5,26]]},"DOI":"10.1145\/3603781.3603827","type":"proceedings-article","created":{"date-parts":[[2023,7,27]],"date-time":"2023-07-27T18:02:29Z","timestamp":1690480949000},"page":"265-272","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["LanYUAN, a GPT large model using Curriculum Learning and Sparse Attention"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-5520-6000","authenticated-orcid":false,"given":"Gonghai","family":"Zhou","sequence":"first","affiliation":[{"name":"School of Information Science &amp; Engineering, Lanzhou University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6180-4457","authenticated-orcid":false,"given":"Yuhong","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Information Science &amp; Engineering, Lanzhou University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-3812-1252","authenticated-orcid":false,"given":"Rizhen","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Information Science &amp; Engineering, Lanzhou University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5616-9780","authenticated-orcid":false,"given":"Yang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Supercomputing Center of Lanzhou University, Lanzhou University, China"}]}],"member":"320","published-online":{"date-parts":[[2023,7,27]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"6008","volume-title":"Advances in neural information processing systems","author":"Vaswani A.","year":"2017","unstructured":"Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A. N. , Kaiser , \u0141., and Polosukhin , I . Attention is all you need . In Advances in neural information processing systems , pp. 5998\u2013 6008 , 2017 . Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998\u20136008, 2017."},{"key":"e_1_3_2_1_2_1","volume-title":"He Y. Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training[J]. arXiv preprint arXiv:2108.06084","author":"Li C","year":"2021","unstructured":"Li C , Zhang M , He Y. Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training[J]. arXiv preprint arXiv:2108.06084 , 2021 . Li C, Zhang M, He Y. Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training[J]. arXiv preprint arXiv:2108.06084, 2021."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553380"},{"key":"e_1_3_2_1_4_1","volume-title":"Improving language understanding by generative pre-training[J]","author":"Radford A","year":"2018","unstructured":"Radford A , Narasimhan K , Salimans T , Improving language understanding by generative pre-training[J] . 2018 . Radford A, Narasimhan K, Salimans T, Improving language understanding by generative pre-training[J]. 2018."},{"key":"e_1_3_2_1_5_1","volume-title":"Scaling Laws for Neural Language Models","author":"Kaplan","year":"2020","unstructured":"J. Kaplan , Yuan. McCandlish, T. Henighan , T. B. Brown , B. Chess , R. Child , Yuan. Gray, A. Radford , J. Wu , and D. Amodei , \u201c Scaling Laws for Neural Language Models ,\u201d arXiv, 2020 . J. Kaplan, Yuan. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, Yuan. Gray, A. Radford, J. Wu, and D. Amodei, \u201cScaling Laws for Neural Language Models,\u201d arXiv, 2020."},{"key":"e_1_3_2_1_6_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","author":"Liu M.","year":"2019","unstructured":"Y. Liu , M. Ott , N. Goyal , J. Du , M. Joshi , D. Chen , O. Levy , M. Lewis , L. Zettlemoyer , and V. Stoyanov , \u201c RoBERTa: A Robustly Optimized BERT Pretraining Approach ,\u201d 2019 . Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, \u201cRoBERTa: A Robustly Optimized BERT Pretraining Approach,\u201d 2019."},{"key":"e_1_3_2_1_7_1","volume-title":"Language Models are Few-Shot Learners","author":"Brown B.","year":"2020","unstructured":"T. B. Brown , B. Mann , N. Ryder , M. Subbiah , J. Kaplan , P. Dhariwal , A. Neelakantan , P. Shyam , G. Sastry , A. Askell , Yuan. Agarwal, A. Herbert-Voss , G. Krueger , T. Henighan , R. Child , A. Ramesh , D. M. Ziegler , J. Wu , C. Winter , C. Hesse , M. Chen , E. Sigler , M. Litwin , Yuan. Gray, B. Chess , J. Clark , C. Berner , Yuan. McCandlish, A. Radford , I. Sutskever , and D. Amodei , \u201c Language Models are Few-Shot Learners ,\u201d 2020 . T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, Yuan. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, Yuan. Gray, B. Chess, J. Clark, C. Berner, Yuan. McCandlish, A. Radford, I. Sutskever, and D. Amodei, \u201cLanguage Models are Few-Shot Learners,\u201d 2020."},{"key":"e_1_3_2_1_8_1","volume-title":"Improving language understanding by generative pretraining","author":"Radford A.","year":"2018","unstructured":"Radford , A. , Narasimhan , K. , Salimans , T. , and Sutskever , I . Improving language understanding by generative pretraining . 2018 a. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. 2018a."},{"key":"e_1_3_2_1_9_1","volume-title":"Language models are unsupervised multitask learners","author":"Radford A.","year":"2018","unstructured":"Radford , A. , Wu , J. , Child , R. , Luan , D. , Amodei , D. , and Sutskever , I . Language models are unsupervised multitask learners . 2018 b. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2018b."},{"key":"e_1_3_2_1_10_1","volume-title":"Large-scale pre-trained language model in zero-shot and few-shot learning[J]. arXiv preprint arXiv:2110.04725","author":"Wu Yuan X","year":"2021","unstructured":"Wu Yuan , Zhao X , Yu T , Yuan 1. 0 : Large-scale pre-trained language model in zero-shot and few-shot learning[J]. arXiv preprint arXiv:2110.04725 , 2021 . Wu Yuan, Zhao X, Yu T, Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning[J]. arXiv preprint arXiv:2110.04725, 2021."},{"key":"e_1_3_2_1_11_1","volume-title":"Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model[J]. arXiv preprint arXiv:2201.11990","author":"Smith S","year":"2022","unstructured":"Smith S , Patwary M , Norick B , Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model[J]. arXiv preprint arXiv:2201.11990 , 2022 . Smith S, Patwary M, Norick B, Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model[J]. arXiv preprint arXiv:2201.11990, 2022."},{"key":"e_1_3_2_1_12_1","volume-title":"Puri R","author":"Shoeybi M","year":"1909","unstructured":"Shoeybi M , Patwary M , Puri R , Megatron-lm : Training multi-billion parameter language models using model parallelism[J]. arXiv preprint arXiv: 1909 .08053, 2019. Shoeybi M, Patwary M, Puri R, Megatron-lm: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_3_2_1_13_1","volume-title":"Ruwase O","author":"Rasley J","year":"2020","unstructured":"Rasley J , Rajbhandari S , Ruwase O , Deepspeed : System optimizations enable training deep learning models with over 100 billion parameters[C]\/\/Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2020 : 3505-3506. Rasley J, Rajbhandari S, Ruwase O, Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters[C]\/\/Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 3505-3506."},{"key":"e_1_3_2_1_14_1","volume-title":"Ruwase O","author":"Rajbhandari S","year":"2020","unstructured":"Rajbhandari S , Rasley J , Ruwase O , Zero : Memory optimizations toward training trillion parameter models[C]\/\/SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE , 2020 : 1-16. Rajbhandari S, Rasley J, Ruwase O, Zero: Memory optimizations toward training trillion parameter models[C]\/\/SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020: 1-16."},{"key":"e_1_3_2_1_15_1","volume-title":"Yang Y","author":"Dai Z","year":"2018","unstructured":"Dai Z , Yang Z , Yang Y , Transformer-xl : Language modeling with longer-term dependency[J]. 2018 . Dai Z, Yang Z, Yang Y, Transformer-xl: Language modeling with longer-term dependency[J]. 2018."},{"key":"e_1_3_2_1_16_1","volume-title":"Monotonic chunkwise attention[J]. arXiv preprint arXiv:1712.05382","author":"Chiu C C","year":"2017","unstructured":"Chiu C C , Raffel C. Monotonic chunkwise attention[J]. arXiv preprint arXiv:1712.05382 , 2017 . Chiu C C, Raffel C. Monotonic chunkwise attention[J]. arXiv preprint arXiv:1712.05382, 2017."},{"key":"e_1_3_2_1_17_1","volume-title":"Massive exploration of neural machine translation architectures[J]. arXiv preprint arXiv:1703.03906","author":"Britz D","year":"2017","unstructured":"Britz D , Goldie A , Luong M T , Massive exploration of neural machine translation architectures[J]. arXiv preprint arXiv:1703.03906 , 2017 . Britz D, Goldie A, Luong M T, Massive exploration of neural machine translation architectures[J]. arXiv preprint arXiv:1703.03906, 2017."},{"key":"e_1_3_2_1_18_1","volume-title":"Generating long sequences with sparse transformers[J]. arXiv preprint arXiv:1904.10509","author":"Child R","year":"2019","unstructured":"Child R , Gray S , Radford A , Generating long sequences with sparse transformers[J]. arXiv preprint arXiv:1904.10509 , 2019 . Child R, Gray S, Radford A, Generating long sequences with sparse transformers[J]. arXiv preprint arXiv:1904.10509, 2019."},{"key":"e_1_3_2_1_19_1","volume-title":"Measuring the effects of data parallelism on neural network training[J]. arXiv preprint arXiv:1811.03600","author":"Shallue C J","year":"2018","unstructured":"Shallue C J , Lee J , Antognini J , Measuring the effects of data parallelism on neural network training[J]. arXiv preprint arXiv:1811.03600 , 2018 . Shallue C J, Lee J, Antognini J, Measuring the effects of data parallelism on neural network training[J]. arXiv preprint arXiv:1811.03600, 2018."},{"key":"e_1_3_2_1_20_1","volume-title":"Puri R","author":"Shoeybi M","year":"1909","unstructured":"Shoeybi M , Patwary M , Puri R , Megatron-lm : Training multi-billion parameter language models using model parallelism[J]. arXiv preprint arXiv: 1909 .08053, 2019. Shoeybi M, Patwary M, Puri R, Megatron-lm: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_3_2_1_21_1","volume-title":"Bapna A","author":"Huang Y","year":"2019","unstructured":"Huang Y , Cheng Y , Bapna A , Gpipe : Efficient training of giant neural networks using pipeline parallelism[J]. Advances in neural information processing systems, 2019 , 32. Huang Y, Cheng Y, Bapna A, Gpipe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in neural information processing systems, 2019, 32."},{"key":"e_1_3_2_1_22_1","volume-title":"Networking, Storage and Analysis","author":"Haidar A","year":"2018","unstructured":"Haidar A , Tomov S , Dongarra J , Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]\/\/SC18: International Conference for High Performance Computing , Networking, Storage and Analysis . IEEE , 2018 : 603-613. Haidar A, Tomov S, Dongarra J, Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]\/\/SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018: 603-613."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.aiopen.2021.08.002"},{"key":"e_1_3_2_1_24_1","volume-title":"SearchNetworking. Accessed","author":"DeepSpeed","year":"2023","unstructured":"DeepSpeed team,\u201d DeepSpeed : Extreme-scale model training for everyone \u201d SearchNetworking. Accessed : Mar 17, 2023 . [Online]. Available:https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/ DeepSpeed team,\u201d DeepSpeed: Extreme-scale model training for everyone\u201d SearchNetworking. Accessed: Mar 17, 2023. [Online]. Available:https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/"},{"key":"e_1_3_2_1_25_1","volume-title":"CLUE: A Chinese language understanding evaluation benchmark[J]. arXiv preprint arXiv:2004.05986","author":"Xu L","year":"2020","unstructured":"Xu L , Hu H , Zhang X , CLUE: A Chinese language understanding evaluation benchmark[J]. arXiv preprint arXiv:2004.05986 , 2020 Xu L, Hu H, Zhang X, CLUE: A Chinese language understanding evaluation benchmark[J]. arXiv preprint arXiv:2004.05986, 2020"}],"event":{"name":"CNIOT'23: 2023 4th International Conference on Computing, Networks and Internet of Things","acronym":"CNIOT'23","location":"Xiamen China"},"container-title":["Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603781.3603827","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3603781.3603827","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:10Z","timestamp":1750178830000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603781.3603827"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,26]]},"references-count":25,"alternative-id":["10.1145\/3603781.3603827","10.1145\/3603781"],"URL":"https:\/\/doi.org\/10.1145\/3603781.3603827","relation":{},"subject":[],"published":{"date-parts":[[2023,5,26]]},"assertion":[{"value":"2023-07-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}