{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T02:47:16Z","timestamp":1773802036680,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"14","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Generating realistic and coordinated 3D human motion for multiple individuals within complex environments remains a significant challenge. Existing text-to-motion methods are often ``blind'' to the physical scene, leading to implausible motions, while scene-conditioned (HSI) approaches demand cumbersome full 3D data and largely neglect multi-person dynamics. To address these limitations, we introduce the VL2Motion paradigm and its embodiment, MMG-VL, a hierarchical framework that generates coordinated multi-person motions from the most accessible inputs: a single 2D image and natural language. MMG-VL first employs a Scene-Aware Intent Planner (SAIP) to interpret the visual context and decompose the user's command into a set of spatially-grounded, multi-person action blueprints. Subsequently, a Coordinated Motion Synthesizer (CMS) translates these blueprints into high-fidelity 3D motion sequences. The synergy between these stages is driven by two novel loss functions: a Spatial-Semantic Grounding Loss  to ensure the planner's output is grounded in visual reality, and a Coordinated Environmental Realism Loss that enforces physical constraints and coherent group dynamics during synthesis. To facilitate this research, we introduce HumanVL, the first large-scale dataset featuring multi-person activities in multi-room scenes, providing aligned images, text, blueprints, 3D motions, and scene geometry. Extensive experiments demonstrate that MMG-VL significantly outperforms existing methods in generating spatially coherent, physically realistic, and coordinated multi-person motions, paving the way for more scalable and intuitive creation of dynamic virtual worlds.<\/jats:p>","DOI":"10.1609\/aaai.v40i14.38156","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:09:31Z","timestamp":1773792571000},"page":"11712-11720","source":"Crossref","is-referenced-by-count":0,"title":["MMG-VL: A Vision-Language Driven Approach for Multi-Person Motion Generation"],"prefix":"10.1609","volume":"40","author":[{"given":"Songyuan","family":"Yang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wanrong","family":"Huang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yinuo","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhang","family":"Ke-Di","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xihuai","family":"He","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shaowu","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huibin","family":"Tan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38156\/42118","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/38156\/42118","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T00:09:31Z","timestamp":1773792571000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/38156"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i14.38156","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}