Event-Driven Storytelling

Abstract

In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability.

Method

Generating multi-human contextual motions in a holistic manner imposes prohibitive reasoning complexities. To address this, our framework operates on an event basis, decomposing the entire problem into manageable complexities. Powered by large language models (LLMs), the action planning module generates a sequence of events in an online manner, considering the dynamic relationships amoung human-human and human-scene interactions. Receiving the events, the motion synthesis module realizes the events into 3D motions of characters.

We have three LLM submodules in our action planning module: the scene describer, the narrator, and the event parser. The scene describer generates a textual scene description from the given 3D scene such that our planning modules can understand the necessary context from the spatial arrangement. The narrator generates a sequence of events, which are converted into detailed information by the event parser. Each module is assigned to smaller, well-defined tasks such that the system stays performant and scalable for multi-human behavior planning.

Results

Results on House, Office, and Restaurant scenes.
Our framework can create contextually relevant motions of multiple characters in 3D scenes.

Ablation Study

House Scenario

Initial Setting and Input Storytelling

{character_A} and {character_B} have a conversation in the living room, discussing their plans for the day. {character_C} makes coffee and joins them, sitting down and participating in the conversation. {character_D}, wanting to work on an assignment in a quiet place, sits on one of the chairs in a less crowded area and works on his laptop.

Comparison

Ours

- All events are represented properly

w/o Event

- {C} does not make coffee

Object List

- {B} is not seated in the living room

- {D} is not positioned in the less crowded area

- {A} and {D} have overlapping positions.

Scene Graph

- {C} goes to kitchen instead of the living room to join the conversation

- {C} and {D} have overlapping positions.

Office Scenario

Initial Setting and Input Storytelling

{character_A} first takes out a drink from the cabinet, which is in the rest area, and then drinks it while sitting on a seat in front of it. {character_B} and {character_C} meet each other in the rest area with a handshake, and then chat with each other.

Comparison

Ours

- All events are represented properly

w/o Event

- {A} does not take out a drink first

Object List

- {A} fails to locate the cabinet in the rest area

Scene Graph

- {A} fails to locate the cabinet in the rest area

Restaurant Scenario

Initial Setting and Input Storytelling

{character_A} and {character_B} eat dinner separately, sitting at different tables in the dining area. {character_C} takes a phone call in front of the reception desk, and then joins {character_A}'s table to eat dinner together. On the other hand, {character_D} dances for a while next to a table in the bar area, then joins {character_B}'s.

Comparison

Ours

- All events are represented properly

w/o Event

- {C} does not take a phone call before joining {A}

- {D} does not dance first before joining {B}

- {A, B, C, D} fails to properly sit on 'chair's

Object List

- {B} fails to locate the table in the dining area

- {D} does not join {B} after dancing

Scene Graph

- {A, B} fails to properly sit on 'chair's

- {B} fails to locate the table in the dining area

- {D} does not dance in the bar area

MPH11 Scenario

Initial Setting and Input Storytelling

{character_A} do his programming assignments in the desk, while {character_B} and {character_C} have a conversation in the sofa.

Comparison

Ours

- All events are represented properly

w/o Event

- All events are represented properly

Object List

- {A} fails to sit on the chair associated with desk

Scene Graph

- All events are represented properly

Event-Driven Storytelling
with Multiple Lifelike Humans in a 3D scene

Our framework can automatically creates behavioral plans and corresponding motions for multiple characters, taking into account the 3D scene and the behaviors of other characters.

Abstract

Method

Results