Icon InteractAnything: Zero-shot Human Object-Interaction Synthesis via LLM Feedback and Object Affordance Parsing

1Peking University, 2State Key Laboratory of General Artificial Intelligence, BIGAI, 3Beijing Institute of Technology, 4The Chinese University of Hong Kong, Shenzhen
CVPR 2025 Highlight
InteractAnything Teaser

3D human object interaction synthesis results by InteractAnything. Our method enables the generation of diverse, detailed, and novel interactions for open-set 3D objects. Given a simple text description with goal interaction and any object mesh as input, we can synthesize different natural HOI results without training on 3D assets. The orange and green boxes of (b) indicate detailed contact poses from different views.



Abstract

Recent advances in 3D human-aware generation have made significant progress. However, existing methods still struggle with generating novel Human Object Interaction (HOI) from text, particularly for open-set objects. We identify three main challenges of this task: precise human-object relation reasoning, affordance parsing for any object, and detailed human interaction pose synthesis aligning description and object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework without training on specific datasets, leveraging the knowledge from large-scale pre-trained models. Specifically, the human-object relations are inferred from large language models (LLMs) to initialize object properties and guide the optimization process. Then we utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. The initial human pose is generated by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. Finally, we introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts, including hands in grasping. This is achieved by distilling human-level feedback from LLMs to capture detailed human-object relations from the text instruction. Extensive experiments validate the effectiveness of our approach compared to prior works, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.


Framework Overview

InteractAnything Framework

Framework of InteractAnything. Given a text description and an open-set object mesh as input, our approach begins by querying LLM feedback to infer precise human-object relationships, which are used to initialize object properties (Section 3.2). Next, we analyze the contact affordance on the object geometry (Section 3.3). The human pose is synthesized using a pre-trained 2D diffusion model, guided by SSDS loss and designed spatial constraint (Section 3.4). Finally, based on the targeted object contact areas and a plausible human pose, we perform expressive HOI optimization to synthesize realistic and contact-accurate 3D human object interactions (Section 3.5).

Comparison Results

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: hold Object: baby

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: lift Object: backpack

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: stand on Object: basketball

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: lie on Object: bed

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: ride Object: bicycle

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: hold Object: car

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: hold Object: chair

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: print Object: keyboard

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: hold Object: knife

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: ride Object: motorcycle

Magic3D

DreamFusion

DreamFusion*

DreamHOI

Ours

Action: hug Object: robot

BibTeX


      @misc{zhang2025interactanythingzeroshothumanobject,
        title={InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing}, 
        author={Jinlu Zhang and Yixin Chen and Zan Wang and Jie Yang and Yizhou Wang and Siyuan Huang},
        year={2025},
        eprint={2505.24315},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.24315}, 
      }