AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

Abstract Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity.

AuDirector Framework

Overview of the AuDirector framework. (1) Identity-Aware Pre-Production: Generate character profiles and retrieve semantically matched timbres to build dialogue scripts with emotional guidance. (2) Collaborative Synthesis and Correction: Synthesize speech and synchronized sound/background music, use the Critic agent to trigger the regeneration of low-quality samples (red arrows), and finally integrate the tracks and output the final audio. (3) Human-Guided Interactive Refinement: Interprets natural language feedback to update scripts, triggering targeted regeneration of target fragments.

Sample Demonstrations

Radio Drama: Sherlock Holmes and Dr. Watson keep night watch at Baker Street, exposing the charade of a "dead father's" haunting through red clay on boots and gunpowder traces.

Podcast: In an era where Spotify/Apple Music is readily available and almost free, why should someone spend hundreds of dollars on a bulky vinyl record and have to endure the hassle of changing faces?

Radio Drama: 唐僧师徒四人受困断崖险境，偶遇神秘长者笑谈间渡河。

Radio Drama: 钢琴家苏羽在老城区雨夜街角重逢昔日旧爱。

Comparative Study

We evaluate the performance of AuDirector across four representative input scenarios by comparing it with two baseline systems—WavJourney and PodAgent (which is limited exclusively to podcast generation)—alongside an ablation variant, AuDirector (w/o Critic). To ensure a rigorous and fair evaluation of each system's orchestration capabilities, we unified the underlying components across all frameworks: all systems are driven by the same Large Language Model (LLM) and utilize identical audio generation backends, including TTS, TTA, and TTM models.

Radio Drama: Sherlock Holmes and Dr. Watson keep night watch at Baker Street, exposing the charade of a "dead father's" haunting through red clay on boots and gunpowder traces.

WavJourney

PodAgent

AuDirector (w/o Critic)

AuDirector

Podcast: In an era where Spotify/Apple Music is readily available and almost free, why should someone spend hundreds of dollars on a bulky vinyl record and have to endure the hassle of changing faces?

WavJourney