Fine-grained Human Motion Understanding with Language Models

June 22, 2026·

Thomas Markhorst

Zhi-Yi Lin

Jouh Yeong Chew

Jan van Gemert

Xucong Zhang

· 0 min read

PDF Code Dataset Poster Project Slides Source Document Video Preprint

Abstract

In this work, we propose FiGMo, an LLM-based model for fine-grained human motion understanding that represents motion as a sequence of skeletal poses with explicit timestamps for each pose. Each pose encodes body joint positions and is temporally grounded with timestamp tokens, allowing the model to reason about motion order, duration, and rhythm. To study what supervision is needed for motion-language reasoning, we construct a diverse training mixture spanning pose captioning, pose question answering, motion captioning, and motion question answering. Our ablations show that the primary gains come from the diversity of pose- and motion-level supervision, while staged training provides a smaller additional benefit. Different from previous works that rely on ground-truth 3D motion capture, our approach supports both 2D and 3D skeletal motion representations through a unified pose encoder, and can optionally incorporate video to provide contextual information. Extensive experiments on BABEL-QA, HuMMan-QA, CompMo, NTU-RGB+D, and QEVD-Coach demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of explicit temporal encoding and diverse pose- and motion-level supervision for fine-grained human motion understanding. Notably, even when using only 2D skeletal input, our approach surpasses previous 3D-based methods

Type

Journal article

Publication

pre-print

Last updated on June 22, 2026

PolySLGen: Online Multimodal Speaking–Listening Reaction Generation in Polyadic Interaction April 8, 2026 →

No results found

Fine-grained Human Motion Understanding with Language Models