The Tencent ARC team recently launched the innovative AudioStory model. By combining large language models (LLMs) with an audio generation system, this model successfully addresses the challenges of temporal coherence and combinatorial reasoning faced by traditional text-to-audio techniques in long-form narratives. This breakthrough provides a novel solution for diverse tasks such as video dubbing, audio continuation, and long-form narrative synthesis.
The core of AudioStory lies in its unified understanding and generation framework, which decomposes complex narratives into temporal subtasks while maintaining consistency in scene transitions and emotional tone. Technical highlights include an innovative decoupling bridging mechanism that divides the work between the large language model and the audio generator, as well as an end-to-end training approach, significantly improving the synergy between command understanding and audio generation.
To validate the model's performance, the team constructed the AudioStory-10K benchmark dataset, which includes animated soundscapes and natural-sounding narratives. Experiments demonstrate that the model outperforms existing technologies in both single-audio and long-form narrative generation tasks, demonstrating superior command-following capabilities and audio quality. The team has now made the inference code public and released demonstration cases such as the "Tom and Jerry" dubbing, fully verifying its wide applicability.