Dynamic Masking Strategy: An Effective Approach to Enhancing Accurate Human Motion Generation

Main Article Content

Xuan Liu
Zhiyang Zhang
Xiangyu Qu
Shaojun Yuan
Yidian Liu
Chaomurilige
Zheng Liu
Shan Jiang

Abstract

Human motion generation has become an important research direction in computer vision and human motion modelling. Current motion generation methods typically rely on static or random masking during training, which fail to adequately capture dynamic variations in joint movement amplitude and temporal characteristics, resulting in suboptimal accuracy in generated motions. To address this, we propose a dynamic masking strategy (DMS) based on motion amplitude, which dynamically adjusts the mask probability distribution by incorporating both motion amplitude and temporal features. By calculating the motion amplitude of each joint and adapting the mask timing, the model is directed to focus on key movements during training, enhancing the quality of motion generation. Experimental results demonstrate that DMS outperforms traditional methods across multiple evaluation metrics, achieving a 15.3% reduction in FID, a 9.0% reduction in trajectory error, and a 6.3% reduction in location error, thereby validating the effectiveness and sophistication of the proposed method.

Article Details

Liu, X., Zhang, Z., Qu, X., Yuan, S., Liu, Y., Chaomurilige, … Jiang, S. (2025). Dynamic Masking Strategy: An Effective Approach to Enhancing Accurate Human Motion Generation. Journal of Artificial Intelligence Research and Innovation, 075–084. https://doi.org/10.29328/journal.jairi.1001009
Research Articles

Copyright (c) 2025 Liu X, et al.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Ahn H, Ha T, Choi Y, Yoo H, Oh S. Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2018;5915–5920. Available from: https://rllab.snu.ac.kr/publications/papers/2018_icra_text2action.pdf

Ahuja C, Morency L-P. Language2pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV). IEEE; 2019;719–728. Available from: https://arxiv.org/abs/1907.01108

Athanasiou N, Petrovich M, Black MJ, Varol G. Teach: temporal action composition for 3D humans. In: 2022 International Conference on 3D Vision (3DV). IEEE; 2022;414–423. Available from: https://arxiv.org/abs/2209.04066

Chen L-H, Zhang J, Li Y, Pang Y, Xia X, Liu T. Humanmac: masked motion completion for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;9544–9555. Available from: https://arxiv.org/abs/2302.03665

Chen X, Su Z, Yang L, Cheng P, Xu L, Fu B, Yu G. Learning a variational motion prior for video-based motion capture. arXiv. 2022. arXiv:2210.15134. Available from: https://arxiv.org/abs/2210.15134

Dai W, Chen L-H, Wang J, Liu J, Dai B, Tang Y. Motionlcm: real-time controllable motion generation via latent consistency model. In: European Conference on Computer Vision. Springer; 2024;390–408. Available from: https://arxiv.org/abs/2404.19759

Gao H, Wan Y, Xu H, Chen L, Xiao J, Ran Q. Swinbtc: transfer learning to brain tumor classification for healthcare electronics using augmented MR images. IEEE Transactions on Consumer Electronics. 2025. Available from: https://ui.adsabs.harvard.edu/link_gateway/2025ITCE...71.2297G/doi:10.1109/TCE.2025.3527061

Ghosh A, Dabral R, Golyanik V, Theobalt C, Slusallek P. IMoS: intent-driven full-body motion synthesis for human-object interactions. In: Computer Graphics Forum. 2023;42:1–12. Available from: https://doi.org/10.1111/cgf.14739

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems. 2014;27. Available from: https://papers.nips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html

Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L. Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;5152–5161. Available from: https://openaccess.thecvf.com/content/CVPR2022/papers/Guo_Generating_Diverse_and_Natural_3D_Human_Motions_From_Text_CVPR_2022_paper.pdf

Guo C, Zuo X, Wang S, Zou S, Sun Q, Deng A, Gong M, Cheng L. Action2motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020;2021–2029. Available from: https://doi.org/10.1145/3394171.3413635

Hassan M, Ceylan D, Villegas R, Saito J, Yang J, Zhou Y, Black MJ. Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;11374–11384. Available from: https://arxiv.org/abs/2108.08284

Huang S, Wang Z, Li P, Jia B, Liu T, Zhu Y, Liang W, Zhu S-C. Diffusion-based generation, optimization, and planning in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;16750–16761. Available from: https://arxiv.org/abs/2301.06015

Huang Y, Wan W, Yang Y, Callison-Burch C, Yatskar M, Liu L. Como: controllable motion generation through language-guided pose code editing. In: European Conference on Computer Vision. Springer; 2024;180–196. Available from: https://arxiv.org/abs/2403.13900

Jiang N, Liu T, Cao Z, Cui J, Zhang Z, Chen Y, Wang H, Zhu Y, Huang S. Full-body articulated human-object interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;9365–9376. Available from: https://openaccess.thecvf.com/content/ICCV2023/html/Jiang_Full-Body_Articulated_Human-Object_Interaction_ICCV_2023_paper.html

Jiang X, Chu M, Wang X, Huang R. A survey on human motion generation tasks: consistency, diversity, and customization. In: 2024 7th Asia Conference on Cognitive Engineering and Intelligent Interaction (CEII). IEEE Computer Society; 2024;269–277. Available from: https://doi.ieeecomputersociety.org/10.1109/CEII65291.2024.00059

Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics. 2020;8:64–77. Available from: https://doi.org/10.1162/tacl_a_00300

Karunratanakul K, Preechakul K, Aksan E, Beeler T, Suwajanakorn S, Tang S. Optimizing diffusion noise can serve as universal motion priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;1334–1345. Available from: https://arxiv.org/abs/2312.11994

Karunratanakul K, Preechakul K, Suwajanakorn S, Tang S. Guided motion diffusion for controllable human motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;2151–2162. Available from: https://arxiv.org/abs/2305.12577

Kim J, Kim J, Choi S. Flame: free-form language-based motion synthesis & editing. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023;37:8255–8263. Available from: https://doi.org/10.1609/aaai.v37i7.25996

Kulkarni N, Rempe D, Genova K, Kundu A, Johnson J, Fouhey D, Guibas L. Nifty: neural object interaction fields for guided human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;947–957. Available from: https://arxiv.org/abs/2307.07511

Li B, Zhao Y, Zhelun S, Sheng L. Danceformer: music-conditioned 3D dance generation with a parametric motion transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:1272–1279. Available from: https://doi.org/10.1609/aaai.v36i2.20014

Li J, Wu J, Liu CK. Object motion guided human motion synthesis. ACM Transactions on Graphics. 2023;42(6):1–11. Available from: https://doi.org/10.1145/3618333

Li R, He C, Li S, Zhang Y, Zhang L. DynaMask: dynamic mask selection for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;11279–11288. Available from: https://openaccess.thecvf.com/content/CVPR2023/html/Li_DynaMask_Dynamic_Mask_Selection_for_Instance_Segmentation_CVPR_2023_paper.html

Li R, Yang S, Ross DA, Kanazawa A. AI choreographer: music-conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;13401–13412. Available from: https://arxiv.org/abs/2101.08779

Li R, Zhang Y, Zhang Y, Zhang H, Guo J, Zhang Y, Liu Y, Li X. Lodge: a coarse-to-fine diffusion network for long dance generation guided by characteristic dance primitives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024;1524–1534. Available from: https://arxiv.org/abs/2403.10518

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: a robustly optimized BERT pretraining approach. arXiv. 2019. arXiv:1907.11692. Available from: https://arxiv.org/abs/1907.11692

Lu S, Chen L-H, Zeng A, Lin J, Zhang R, Zhang L, Shum H-Y. HumanToMaTo: text-aligned whole-body motion generation. arXiv. 2023. arXiv:2310.12978. Available from: https://arxiv.org/abs/2310.12978

Mahmood N, Ghorbani N, Troje NF, Pons-Moll G, Black MJ. AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019;5442–5451. Available from: https://openaccess.thecvf.com/content_ICCV_2019/papers/Mahmood_AMASS_Archive_of_Motion_Capture_As_Surface_Shapes_ICCV_2019_paper.pdf

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–533. Available from: https://doi.org/10.1038/nature14236

Petrovich M, Black MJ, Varol G. Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021;10985–10995. Available from: https://openaccess.thecvf.com/content/ICCV2021/papers/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.pdf

Petrovich M, Black MJ, Varol G. TEMOS: generating diverse human motions from textual descriptions. In: European Conference on Computer Vision. Springer; 2022;480–497. Available from: https://arxiv.org/abs/2204.14109

Pi H, Peng S, Yang M, Zhou X, Bao H. Hierarchical generation of human-object interactions with diffusion probabilistic models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;15061–15073. Available from: https://arxiv.org/abs/2310.02242

Pinyoanuntapong E, Saleem MU, Karunratanakul K, Wang P, Xue H, Chen C, Guo C, Cao J, Ren J, Tulyakov S. ControlMM: controllable masked motion generation. arXiv. 2024. arXiv:2410.10780. Available from: https://arxiv.org/html/2410.10780v1

Rempe D, Luo Z, Peng XB, Yuan Y, Kitani K, Kreis K, Fidler S, Litany O. Trace and pace: controllable pedestrian animation via guided trajectory diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;13756–13766. Available from: https://arxiv.org/abs/2304.01893

Siyao L, Yu W, Gu T, Lin C, Wang Q, Qian C, Loy CC, Liu Z. Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;11050–11059. Available from: https://arxiv.org/abs/2203.13055

Starke S, Zhang H, Komura T, Saito J. Neural state machine for character-scene interactions. ACM Transactions on Graphics. 2019;38(6):178. Available from: https://doi.org/10.1145/3355089.3356505

Tevet G, Gordon B, Hertz A, Bermano AH, Cohen-Or D. MotionCLIP: exposing human motion generation to CLIP space. In: European Conference on Computer Vision. Springer; 2022;358–374. Available from: https://arxiv.org/abs/2203.08063

Tevet G, Raab S, Gordon B, Shafir Y, Cohen-Or D, Bermano AH. Human motion diffusion model. arXiv. 2022. arXiv:2209.14916. Available from: https://arxiv.org/abs/2209.14916

Tseng J, Castellon R, Liu K. EDGE: editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;448–458. Available from: https://openaccess.thecvf.com/content/CVPR2023/papers/Tseng_EDGE_Editable_Dance_Generation_From_Music_CVPR_2023_paper.pdf

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. Available from: https://arxiv.org/abs/1706.03762

Wan W, Dou Z, Komura T, Wang W, Jayaraman D, Liu L. TLControl: trajectory and language control for human motion synthesis. In: European Conference on Computer Vision. Springer; 2024;37–54. Available from: https://link.springer.com/chapter/10.1007/978-3-031-72913-3_3

Wang J, Rong Y, Liu J, Yan S, Lin D, Dai B. Towards diverse and natural scene-aware 3D human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;20460–20469. Available from: https://arxiv.org/abs/2205.13001

Wang Z, Chen Y, Liu T, Zhu Y, Liang W, Huang S. Humanise: language-conditioned human motion generation in 3D scenes. Advances in Neural Information Processing Systems. 2022;35:14959–14971. Available from: https://arxiv.org/abs/2210.09729

Xie Y, Jampani V, Zhong L, Sun D, Jiang H. OmniControl: control any joint at any time for human motion generation. arXiv. 2023. arXiv:2310.08580. Available from: https://arxiv.org/abs/2310.08580

Xu S, Li Z, Wang Y-X, Gui L-Y. InterDiff: generating 3D human-object interactions with physics-informed diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;14928–14940. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_InterDiff_Generating_3D_Human-Object_Interactions_with_Physics-Informed_Diffusion_ICCV_2023_paper.pdf

Zhang L, Rao A, Agrawala M. Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;3836–3847. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf

Zhao K, Zhang Y, Wang S, Beeler T, Tang S. Synthesizing diverse human motions in 3D indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;14738–14749. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_Synthesizing_Diverse_Human_Motions_in_3D_Indoor_Scenes_ICCV_2023_paper.pdf

Zhong C, Hu L, Zhang Z, Xia S. AttT2M: text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023;509–519. Available from: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhong_AttT2M_Text-Driven_Human_Motion_Generation_with_Multi-Perspective_Attention_Mechanism_ICCV_2023_paper.pdf

Zhou Z, Wang B. UDE: a unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023;5632–5641. Available from: https://openaccess.thecvf.com/content/CVPR2023/html/Zhou_UDE_A_Unified_Driving_Engine_for_Human_Motion_Generation_CVPR_2023_paper.html