AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

CVPR 2024

1Institute for Artificial Intelligence, Peking University, 2National Key Laboratory of General Artificial Intelligence, BIGAI, 3School of Artificial Intelligence, Beijing University of Posts and Telecommunications

AnySkill generate open-vocabulary, physical-plausiable, and interaction motions.


Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.


AnySkill model

3D Visualization

Note: Select a motion from dropdown and drag to move your view.

More motions

Jump high

Cha-cha dance

Walk backwards



  title={Anyskill: Learning Open-Vocabulary Physical Skill for Interactive Agents},
  author={Cui, Jieming and Liu, Tengyu and Liu, Nian and Yang, Yaodong and Zhu, Yixin and Huang, Siyuan},
  booktitle=Conference on Computer Vision and Pattern Recognition(CVPR),