In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.
Generating multi-human contextual motions in a holistic manner imposes prohibitive reasoning complexities. To
address this, our framework operates on an event basis, decomposing the entire problem into manageable
complexities. Powered by large language models (LLMs), the action planning module generates a sequence of
events in an online manner, considering the dynamic relationships amoung human-human and human-scene
interactions. Receiving the events, the motion synthesis module realizes the events into 3D motions of
characters.
We have three LLM submodules in our action planning module: the scene describer, the narrator, and the event parser. The scene describer generates a textual scene description from the given 3D scene such that our planning modules can understand the necessary context from the spatial arrangement. The narrator generates a sequence of events, which are converted into detailed information by the event parser. Each module is assigned to smaller, well-defined tasks such that the system stays performant and scalable for multi-human behavior planning.
Results on House, Office, and Restaurant scenes.
Our framework can create contextually relevant motions of multiple characters in 3D scenes.
Initial Setting and Input Storytelling
Comparison
Ours
- All events are represented properly
w/o Event
- {C} does not make coffee
Object List
- {B} is not seated in the living room
- {D} is not positioned in the less crowded area
- {A} and {D} have overlapping positions.
Scene Graph
- {C} goes to kitchen instead of the living room to join the conversation
- {C} and {D} have overlapping positions.
Initial Setting and Input Storytelling
Comparison
Ours
- All events are represented properly
w/o Event
- {A} does not take out a drink first
Object List
- {A} fails to locate the cabinet in the rest area
Scene Graph
- {A} fails to locate the cabinet in the rest area
Initial Setting and Input Storytelling
Comparison
Ours
- All events are represented properly
w/o Event
- {C} does not take a phone call before joining {A}
- {D} does not dance first before joining {B}
- {A, B, C, D} fails to properly sit on 'chair's
Object List
- {B} fails to locate the table in the dining area
- {D} does not join {B} after dancing
Scene Graph
- {A, B} fails to properly sit on 'chair's
- {B} fails to locate the table in the dining area
- {D} does not dance in the bar area
Initial Setting and Input Storytelling
Comparison
Ours
- All events are represented properly
w/o Event
- All events are represented properly
Object List
- {A} fails to sit on the chair associated with desk
Scene Graph
- All events are represented properly