Dynamic Masking Strategy: An Effective Approach to Enhancing Accurate Human Motion Generation
Main Article Content
Abstract
Human motion generation has become an important research direction in computer vision and human motion modelling. Current motion generation methods typically rely on static or random masking during training, which fail to adequately capture dynamic variations in joint movement amplitude and temporal characteristics, resulting in suboptimal accuracy in generated motions. To address this, we propose a dynamic masking strategy (DMS) based on motion amplitude, which dynamically adjusts the mask probability distribution by incorporating both motion amplitude and temporal features. By calculating the motion amplitude of each joint and adapting the mask timing, the model is directed to focus on key movements during training, enhancing the quality of motion generation. Experimental results demonstrate that DMS outperforms traditional methods across multiple evaluation metrics, achieving a 15.3% reduction in FID, a 9.0% reduction in trajectory error, and a 6.3% reduction in location error, thereby validating the effectiveness and sophistication of the proposed method.
Article Details
Copyright (c) 2025 Liu X, et al.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Ahn H, Ha T, Choi Y, Yoo H, Oh S. Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2018;5915–5920. Available from: https://rllab.snu.ac.kr/publications/papers/2018_icra_text2action.pdf
Ahuja C, Morency L-P. Language2pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV). IEEE; 2019;719–728. Available from: https://arxiv.org/abs/1907.01108
Athanasiou N, Petrovich M, Black MJ, Varol G. Teach: temporal action composition for 3D humans. In: 2022 International Conference on 3D Vision (3DV). IEEE; 2022;414–423. Available from: https://arxiv.org/abs/2209.04066
Chen L-H, Zhang J, Li Y, Pang Y, Xia X, Liu T. Humanmac: masked motion completion for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;9544–9555. Available from: https://arxiv.org/abs/2302.03665
Chen X, Su Z, Yang L, Cheng P, Xu L, Fu B, Yu G. Learning a variational motion prior for video-based motion capture. arXiv. 2022. arXiv:2210.15134. Available from: https://arxiv.org/abs/2210.15134
Dai W, Chen L-H, Wang J, Liu J, Dai B, Tang Y. Motionlcm: real-time controllable motion generation via latent consistency model. In: European Conference on Computer Vision. Springer; 2024;390–408. Available from: https://arxiv.org/abs/2404.19759
Gao H, Wan Y, Xu H, Chen L, Xiao J, Ran Q. Swinbtc: transfer learning to brain tumor classification for healthcare electronics using augmented MR images. IEEE Transactions on Consumer Electronics. 2025. Available from: https://ui.adsabs.harvard.edu/link_gateway/2025ITCE...71.2297G/doi:10.1109/TCE.2025.3527061
Ghosh A, Dabral R, Golyanik V, Theobalt C, Slusallek P. IMoS: intent-driven full-body motion synthesis for human-object interactions. In: Computer Graphics Forum. 2023;42:1–12. Available from: https://doi.org/10.1111/cgf.14739
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems. 2014;27. Available from: https://papers.nips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html
Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L. Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;5152–5161. Available from: https://openaccess.thecvf.com/content/CVPR2022/papers/Guo_Generating_Diverse_and_Natural_3D_Human_Motions_From_Text_CVPR_2022_paper.pdf
Guo C, Zuo X, Wang S, Zou S, Sun Q, Deng A, Gong M, Cheng L. Action2motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020;2021–2029. Available from: https://doi.org/10.1145/3394171.3413635
Hassan M, Ceylan D, Villegas R, Saito J, Yang J, Zhou Y, Black MJ. Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;11374–11384. Available from: https://arxiv.org/abs/2108.08284
Huang S, Wang Z, Li P, Jia B, Liu T, Zhu Y, Liang W, Zhu S-C. Diffusion-based generation, optimization, and planning in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;16750–16761. Available from: https://arxiv.org/abs/2301.06015
Huang Y, Wan W, Yang Y, Callison-Burch C, Yatskar M, Liu L. Como: controllable motion generation through language-guided pose code editing. In: European Conference on Computer Vision. Springer; 2024;180–196. Available from: https://arxiv.org/abs/2403.13900
Jiang N, Liu T, Cao Z, Cui J, Zhang Z, Chen Y, Wang H, Zhu Y, Huang S. Full-body articulated human-object interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;9365–9376. Available from: https://openaccess.thecvf.com/content/ICCV2023/html/Jiang_Full-Body_Articulated_Human-Object_Interaction_ICCV_2023_paper.html
Jiang X, Chu M, Wang X, Huang R. A survey on human motion generation tasks: consistency, diversity, and customization. In: 2024 7th Asia Conference on Cognitive Engineering and Intelligent Interaction (CEII). IEEE Computer Society; 2024;269–277. Available from: https://doi.ieeecomputersociety.org/10.1109/CEII65291.2024.00059
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics. 2020;8:64–77. Available from: https://doi.org/10.1162/tacl_a_00300
Karunratanakul K, Preechakul K, Aksan E, Beeler T, Suwajanakorn S, Tang S. Optimizing diffusion noise can serve as universal motion priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;1334–1345. Available from: https://arxiv.org/abs/2312.11994
Karunratanakul K, Preechakul K, Suwajanakorn S, Tang S. Guided motion diffusion for controllable human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;2151–2162. Available from: https://arxiv.org/abs/2305.12577
Kim J, Kim J, Choi S. Flame: free-form language-based motion synthesis & editing. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023;37:8255–8263. Available from: https://doi.org/10.1609/aaai.v37i7.25996
Kulkarni N, Rempe D, Genova K, Kundu A, Johnson J, Fouhey D, Guibas L. Nifty: neural object interaction fields for guided human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;947–957. Available from: https://arxiv.org/abs/2307.07511
Li B, Zhao Y, Zhelun S, Sheng L. Danceformer: music-conditioned 3D dance generation with a parametric motion transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:1272–1279. Available from: https://doi.org/10.1609/aaai.v36i2.20014
Li J, Wu J, Liu CK. Object motion guided human motion synthesis. ACM Transactions on Graphics. 2023;42(6):1–11. Available from: https://doi.org/10.1145/3618333
Li R, He C, Li S, Zhang Y, Zhang L. DynaMask: dynamic mask selection for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;11279–11288. Available from: https://openaccess.thecvf.com/content/CVPR2023/html/Li_DynaMask_Dynamic_Mask_Selection_for_Instance_Segmentation_CVPR_2023_paper.html
Li R, Yang S, Ross DA, Kanazawa A. AI choreographer: music-conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;13401–13412. Available from: https://arxiv.org/abs/2101.08779
Li R, Zhang Y, Zhang Y, Zhang H, Guo J, Zhang Y, Liu Y, Li X. Lodge: a coarse-to-fine diffusion network for long dance generation guided by characteristic dance primitives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;1524–1534. Available from: https://arxiv.org/abs/2403.10518
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: a robustly optimized BERT pretraining approach. arXiv. 2019. arXiv:1907.11692. Available from: https://arxiv.org/abs/1907.11692
Lu S, Chen L-H, Zeng A, Lin J, Zhang R, Zhang L, Shum H-Y. HumanToMaTo: text-aligned whole-body motion generation. arXiv. 2023. arXiv:2310.12978. Available from: https://arxiv.org/abs/2310.12978
Mahmood N, Ghorbani N, Troje NF, Pons-Moll G, Black MJ. AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019;5442–5451. Available from: https://openaccess.thecvf.com/content_ICCV_2019/papers/Mahmood_AMASS_Archive_of_Motion_Capture_As_Surface_Shapes_ICCV_2019_paper.pdf
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–533. Available from: https://doi.org/10.1038/nature14236
Petrovich M, Black MJ, Varol G. Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;10985–10995. Available from: https://openaccess.thecvf.com/content/ICCV2021/papers/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.pdf
Petrovich M, Black MJ, Varol G. TEMOS: generating diverse human motions from textual descriptions. In: European Conference on Computer Vision. Springer; 2022;480–497. Available from: https://arxiv.org/abs/2204.14109
Pi H, Peng S, Yang M, Zhou X, Bao H. Hierarchical generation of human-object interactions with diffusion probabilistic models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;15061–15073. Available from: https://arxiv.org/abs/2310.02242
Pinyoanuntapong E, Saleem MU, Karunratanakul K, Wang P, Xue H, Chen C, Guo C, Cao J, Ren J, Tulyakov S. ControlMM: controllable masked motion generation. arXiv. 2024. arXiv:2410.10780. Available from: https://arxiv.org/html/2410.10780v1
Rempe D, Luo Z, Peng XB, Yuan Y, Kitani K, Kreis K, Fidler S, Litany O. Trace and pace: controllable pedestrian animation via guided trajectory diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;13756–13766. Available from: https://arxiv.org/abs/2304.01893
Siyao L, Yu W, Gu T, Lin C, Wang Q, Qian C, Loy CC, Liu Z. Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;11050–11059. Available from: https://arxiv.org/abs/2203.13055
Starke S, Zhang H, Komura T, Saito J. Neural state machine for character-scene interactions. ACM Transactions on Graphics. 2019;38(6):178. Available from: https://doi.org/10.1145/3355089.3356505
Tevet G, Gordon B, Hertz A, Bermano AH, Cohen-Or D. MotionCLIP: exposing human motion generation to CLIP space. In: European Conference on Computer Vision. Springer; 2022;358–374. Available from: https://arxiv.org/abs/2203.08063
Tevet G, Raab S, Gordon B, Shafir Y, Cohen-Or D, Bermano AH. Human motion diffusion model. arXiv. 2022. arXiv:2209.14916. Available from: https://arxiv.org/abs/2209.14916
Tseng J, Castellon R, Liu K. EDGE: editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;448–458. Available from: https://openaccess.thecvf.com/content/CVPR2023/papers/Tseng_EDGE_Editable_Dance_Generation_From_Music_CVPR_2023_paper.pdf
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. Available from: https://arxiv.org/abs/1706.03762
Wan W, Dou Z, Komura T, Wang W, Jayaraman D, Liu L. TLControl: trajectory and language control for human motion synthesis. In: European Conference on Computer Vision. Springer; 2024;37–54. Available from: https://link.springer.com/chapter/10.1007/978-3-031-72913-3_3
Wang J, Rong Y, Liu J, Yan S, Lin D, Dai B. Towards diverse and natural scene-aware 3D human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;20460–20469. Available from: https://arxiv.org/abs/2205.13001
Wang Z, Chen Y, Liu T, Zhu Y, Liang W, Huang S. Humanise: language-conditioned human motion generation in 3D scenes. Advances in Neural Information Processing Systems. 2022;35:14959–14971. Available from: https://arxiv.org/abs/2210.09729
Xie Y, Jampani V, Zhong L, Sun D, Jiang H. OmniControl: control any joint at any time for human motion generation. arXiv. 2023. arXiv:2310.08580. Available from: https://arxiv.org/abs/2310.08580
Xu S, Li Z, Wang Y-X, Gui L-Y. InterDiff: generating 3D human-object interactions with physics-informed diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;14928–14940. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_InterDiff_Generating_3D_Human-Object_Interactions_with_Physics-Informed_Diffusion_ICCV_2023_paper.pdf
Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;3836–3847. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf
Zhao K, Zhang Y, Wang S, Beeler T, Tang S. Synthesizing diverse human motions in 3D indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;14738–14749. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_Synthesizing_Diverse_Human_Motions_in_3D_Indoor_Scenes_ICCV_2023_paper.pdf
Zhong C, Hu L, Zhang Z, Xia S. AttT2M: text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;509–519. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhong_AttT2M_Text-Driven_Human_Motion_Generation_with_Multi-Perspective_Attention_Mechanism_ICCV_2023_paper.pdf
Zhou Z, Wang B. UDE: a unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;5632–5641. Available from: https://openaccess.thecvf.com/content/CVPR2023/html/Zhou_UDE_A_Unified_Driving_Engine_for_Human_Motion_Generation_CVPR_2023_paper.html