Event-Driven Storytelling
with Multiple Lifelike Humans in a 3D scene

ICCV 2025
Seoul National University1
Chung-Ang University2
1 2

Our framework can automatically creates behavioral plans and corresponding motions for multiple characters, taking into account the 3D scene and the behaviors of other characters.

Abstract

In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.

Method

Generating multi-human contextual motions in a holistic manner imposes prohibitive reasoning complexities. To address this, our framework operates on an event basis, decomposing the entire problem into manageable complexities. Powered by large language models (LLMs), the action planning module generates a sequence of events in an online manner, considering the dynamic relationships amoung human-human and human-scene interactions. Receiving the events, the motion synthesis module realizes the events into 3D motions of characters.

We have three LLM submodules in our action planning module: the scene describer, the narrator, and the event parser. The scene describer generates a textual scene description from the given 3D scene such that our planning modules can understand the necessary context from the spatial arrangement. The narrator generates a sequence of events, which are converted into detailed information by the event parser. Each module is assigned to smaller, well-defined tasks such that the system stays performant and scalable for multi-human behavior planning.

Results

Results on House, Office, and Restaurant scenes.
Our framework can create contextually relevant motions of multiple characters in 3D scenes.

Ablation Study

House Scenario

Office Scenario

Restaurant Scenario

MPH11 Scenario

BibTeX