Trajectory-Aware Motion Generation for Enhanced Naturalness in Interactive Applications
Main Article Content
Abstract
Human motion generation is a pivotal task in the field of data generation, with trajectory-guided methods emerging as a prominent approach due to their ability to provide precise control over motion outcomes. However, achieving a balance between motion naturalness and trajectory accuracy remains a significant challenge. In this paper, we present a novel method, Trajectory-Aware Motion Generator (TAMG) that optimally addresses this challenge. TAMG integrates third-order dynamic features, namely position, velocity, and acceleration, to enhance the naturalness of generated motions while maintaining precise trajectory control. We propose a multimodal feature fusion strategy that combines biomechanical features to ensure accurate motion representation, alongside a sparse sampling strategy based on motion importance distribution to focus on key phases of joint motion. The effectiveness of TAMG is validated through extensive experiments, which demonstrate its superior performance in both trajectory accuracy and motion quality compared to existing methods. This approach offers a simple, effective solution for interactive motion generation tasks, advancing the state of the art in trajectory-guided motion generation.
Article Details
Copyright (c) 2025 Liu X, et al.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Wang CY, Zhou Q, Fitzmaurice G, Anderson F. Videoposevr: authoring virtual reality character animations with online videos. Proc ACM Hum Comput Interact. 2022;6(ISS). Available from: https://doi.org/10.1145/3567728
Ye H, Kwan KC, Su W, Fu H. Aranimator: in-situ character animation in mobile AR with user-defined motion gestures. ACM Trans Graph. 2020;39(4). Available from: https://doi.org/10.1145/3386569.3392404
Hu L, Zhang B, Zhang P, Qi J, Cao J, Gao D, Zhao H, Feng X, Wang Q, Zhuo L, Pan P, Xu Y. A virtual character generation and animation system for e-commerce live streaming. In: Proceedings of the 29th ACM International Conference on Multimedia. MM’21. New York: Association for Computing Machinery; 2021;1202–1211. Available from: https://doi.org/10.1145/3474085.3481547
Thomas S, Ferstl Y, McDonnell R, Ennis C. Investigating how speech and animation realism influence the perceived personality of virtual characters and agents. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). 2022;11–20. Available from: https://doi.org/10.1109/VR51125.2022.00018
Qi J, Jiang G, Li G, Sun Y, Tao B. Intelligent human-computer interaction based on surface EMG gesture recognition. IEEE Access. 2019;7:61378–61387. Available from: https://doi.org/10.1109/ACCESS.2019.2914728
Wang X, Yan K. Immersive human-computer interactive virtual environment using large-scale display system. Future Gener Comput Syst. 2019;96:649–659. Available from: https://doi.org/10.1016/j.future.2017.07.058
Zhou H, Wang D, Yu Y, Zhang Z. Research progress of human-computer interaction technology based on gesture recognition. Electronics. 2023;12(13). Available from: https://doi.org/10.3390/electronics12132805
Xu P. A real-time hand gesture recognition and human-computer interaction system. arXiv e-prints. 2017;1704.07296. Available from: https://doi.org/10.48550/arXiv.1704.07296
Tevet G, Raab S, Gordon B, Shafir Y, Cohen-Or D, Bermano AH. Human motion diffusion model. arXiv e-prints. 2022;2209.14916. Available from: https://doi.org/10.48550/arXiv.2209.14916
Xie Y, Jampani V, Zhong L, Sun D, Jiang H. OmniControl: control any joint at any time for human motion generation. arXiv e-prints. 2023;2310.08580. Available from: https://doi.org/10.48550/arXiv.2310.08580
Ahn H, Ha T, Choi Y, Yoo H, Oh S. Text2Action: generative adversarial synthesis from language to action. arXiv e-prints. 2017;1710.05298. Available from: https://doi.org/10.48550/arXiv.1710.05298
Messina N, Sedmidubsky J, Falchi F, Rebok T. Text-to-motion retrieval: towards joint understanding of human motion data and natural language. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’23. New York: Association for Computing Machinery; 2023;2420–2425. Available from: https://doi.org/10.1145/3539618.3592069
Ahuja C, Morency LP. Language2Pose: natural language grounded pose forecasting. arXiv e-prints. 2019;1907.01108. Available from: https://doi.org/10.48550/arXiv.1907.01108
Zhang J, Zhang Y, Cun X, Zhang Y, Zhao H, Lu H, Shen X, Shan Y. Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;14730–14740. Available from: https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_Generating_Human_Motion_From_Textual_Descriptions_With_Discrete_Representations_CVPR_2023_paper.pdf
Athanasiou N, Petrovich M, Black MJ, Varol G. TEACH: temporal action composition for 3D humans. arXiv e-prints. 2022;2209.04066. Available from: https://doi.org/10.48550/arXiv.2209.04066
Poole B, Jain A, Barron JT, Mildenhall B. DreamFusion: text-to-3D using 2D diffusion. arXiv e-prints. 2022;2209.14988. Available from: https://doi.org/10.48550/arXiv.2209.14988
Xu L, Song Z, Wang D, Su J, Fang Z, Ding C, Gan W, Yan Y, Jin X, Yang X. Actformer: a GAN-based transformer towards general action-conditioned 3D human motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;2228–2238. Available from: https://doi.org/10.48550/arXiv.2203.07706
Ma F, Xia G, Liu Q. Spatial consistency constrained GAN for human motion transfer. IEEE Trans Circuits Syst Video Technol. 2021;32(2):730–742.
Kundu JN, Gor M, Babu RV. Bihmp-gan: bidirectional 3D human motion prediction GAN. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:8553–8560. Available from: https://doi.org/10.1609/aaai.v33i01.33018553
Petrovich M, Black MJ, Varol G. Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;10985–10995. Available from: https://openaccess.thecvf.com/content/ICCV2021/papers/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.pdf
Bie X, Guo W, Leglaive S, Girin L, Moreno-Noguer F, Alameda-Pineda X. Hit-dvae: human motion generation via hierarchical transformer dynamical VAE. arXiv preprint. 2022;2204.01565. Available from: https://doi.org/10.48550/arXiv.2204.01565
Kim H, Kong K, Kim JK, Lee J, Cha G, Jang HD, Wee D, Kang SJ. Enhanced control of human motion generation using action-conditioned transformer VAE with low-rank factorization. IEIE Trans Smart Process Comput. 2024;13(6):609–621. Available from: https://ieiespc.org/ieiespc/XmlViewer/f434509
Zhang M, Cai Z, Pan L, Hong F, Guo X, Yang L, Liu Z. MotionDiffuse: text-driven human motion generation with diffusion model. arXiv e-prints. 2022;2208.15001. Available from: https://doi.org/10.48550/arXiv.2208.15001
Wan W, Dou Z, Komura T, Wang W, Jayaraman D, Liu L. TLControl: trajectory and language control for human motion synthesis. arXiv e-prints. 2023;2311.17135. Available from: https://doi.org/10.48550/arXiv.2311.17135
Karunratanakul K, Preechakul K, Suwajanakorn S, Tang S. Guided motion diffusion for controllable human motion synthesis. arXiv e-prints. 2023;2305.12577. Available from: https://doi.org/10.48550/arXiv.2305.12577
Dai W, Chen LH, Wang J, Liu J, Dai B, Tang Y. MotionLCM: real-time controllable motion generation via latent consistency model. arXiv e-prints. 2024;2404.19759. Available from: https://doi.org/10.48550/arXiv.2305.12577
Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models. 2023. Available from: https://arxiv.org/abs/2302.05543
Dabral R, Hamza Mughal M, Golyanik V, Theobalt C. MoFusion: a framework for denoising-diffusion-based motion synthesis. arXiv e-prints. 2022;2212.04495. Available from: https://doi.org/10.48550/arXiv.2212.04495
Ma J, Bai S, Zhou C. Pretrained diffusion models for unified human motion synthesis. arXiv e-prints. 2022;2212.02837. Available from: https://doi.org/10.48550/arXiv.2212.02837
Zhao M, Liu M, Ren B, Dai S, Sebe N. Modiff: action-conditioned 3D motion generation with denoising diffusion probabilistic models. arXiv e-prints. 2023;2301.03949. Available from: https://doi.org/10.48550/arXiv.2301.03949
Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv e-prints. 2021;2112.10741. Available from: https://doi.org/10.48550/arXiv.2112.10741
Popov V, Vovk I, Gogoryan V, Sadekova T, Kudinov M. Grad-TTS: a diffusion probabilistic model for text-to-speech. arXiv e-prints. 2021;2105.06337. Available from: https://doi.org/10.48550/arXiv.2105.06337
Xu J, Wang X, Cheng W, Cao YP, Shan Y, Qie X, Gao S. Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. arXiv e-prints. 2022;2212.14704. Available from: https://doi.org/10.48550/arXiv.2212.14704
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;15979–15988. Available from: https://doi.org/10.1109/CVPR52688.2022.01553
Kundu JN, Seth S, Jampani V, Rakesh M, Babu RV, Chakraborty A. Self-supervised 3D human pose estimation via part guided novel image synthesis. 2020. Available from: https://arxiv.org/abs/2004.04400
Pinyoanuntapong E, Usama Saleem M, Karunratanakul K, Wang P, Xue H, Chen C, Guo C, Cao J, Ren J, Tulyakov S. ControlMM: controllable masked motion generation. arXiv e-prints. 2024;2410.10780. Available from: https://doi.org/10.48550/arXiv.2410.10780
Three-Dimensional Kinematics and Kinetics. In: John Wiley & Sons, Ltd. 2009;176–199. Chapter 7. Available from: https://doi.org/10.1002/9780470549148.ch7
Yuan Y, Song J, Iqbal U, Vahdat A, Kautz J. PhysDiff: physics-guided human motion diffusion model. 2023. Available from: https://arxiv.org/abs/2212.02500
Zhou H, Guo C, Zhang H, Wang Y. Learning multiscale correlations for human motion prediction. 2021. Available from: https://arxiv.org/abs/2103.10674
Shao D, Shi M, Xu S, Chen H, Huang Y, Wang B. FinePhys: fine-grained human action generation by explicitly incorporating physical laws for effective skeletal guidance. 2025. Available from: https://arxiv.org/abs/2505.13437
Rempe D, Luo Z, Bin Peng X, Yuan Y, Kitani K, Kreis K, Fidler S, Litany O. Trace and pace: controllable pedestrian animation via guided trajectory diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023;13756–13766. Available from: https://doi.org/10.48550/arXiv.2304.01893
Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2023; 3836–3847. Available from: https://doi.org/10.48550/arXiv.2302.05543
Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L. Generating diverse and natural 3D human motions from text. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022;5142–5151. Available from: https://ieeexplore.ieee.org/document/9880214
Mahmood N, Ghorbani N, Troje NF, Pons-Moll G, Black MJ. AMASS: archive of motion capture as surface shapes. arXiv e-prints. 2019;1904.03278. Available from: https://doi.org/10.48550/arXiv.1904.03278
Guo C, Zuo X, Wang S, Zou S, Sun Q, Deng A, Gong M, Cheng L. Action2Motion: conditioned generation of 3D human motions. arXiv e-prints. 2020;2007.15240. Available from: https://doi.org/10.1145/3394171.3413635