Alive Logo

Animate Your World with Lifelike Audio-Video Generation

ByteDance Alive Team

Overview

Alive is a unified audio-video generation model that adapts pretrained Text-to-Video (T2V) models to audio-video generation and animation. Built on the MMDiT architecture, it achieves industry-grade performance for lifelike audio-video generation and animation.

🎬

Unified Audio-Video Generation

Simultaneously supports Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) within a single framework.

⚙️

Advanced Architecture

Features TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment.

📊

High-Quality Data Pipeline

Comprehensive audio-video captioning and quality control for million-level training data.

🏆

SOTA Performance

Consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.

Demo Video

Benchmark Evaluation

Alive-Bench 1.0

We introduce a comprehensive benchmark for joint audio-visual generation that evaluates model performance along three complementary axes—motion quality, visual aesthetic, visual prompt following, audio quality, audio prompt following and audio video synchronization—covering 20+ fine-grained dimensions. This design enables diagnostic evaluation: our benchmark pinpoints which capability fails and why. Crucially, our benchmark is built around usage-like prompts that closely mirror how end users actually describe desired content. As a result, the benchmark reduces the common evaluation–deployment mismatch, where strong offline metrics fail to translate into perceived quality in real applications.

Comparison with SOTA

We conducted extensive two-round human evaluations to benchmark our model's performance against leading competitors (Veo 3.1, Kling 2.6, Wan 2.6, Sora 2, and LTX-2). Across all metrics, Alive ranks at or near the top, indicating a well-balanced capability profile rather than a single-metric advantage. Alive is the best on audio prompt following and audio video synchronization, outperforming other competitors by a notable margin. This indicates a strong advantage in cross-modal understanding and alignment, particularly in faithfully reflecting audio instructions and maintaining tight timing correspondence between audio events and visual content.

Alive Benchmark Results

Introduction of Alive

Alive is an unified audio-video generation model that excels in text-to-video&audio (T2VA), image-to-video&audio (I2VA), text-to-video (T2V), and text-to-video (T2A) generation. It offers flexible resolution and aspect ratio, arbitrary video length, and extensible for character-reference audio-video animation.

Joint Audio-Video Modeling

We propose Alive, a joint generation architecture that seamlessly integrates Audio and Video DiTs via an extended "Dual Stream + Single Stream" paradigm. To resolve temporal granularity mismatches, we introduce UniTemp-RoPE and TA-CrossAttn, which map heterogeneous latents into a shared continuous temporal coordinate system, enforcing physical-time alignment for synchronized audio-visual generation.

Joint Audio-Video Modeling Framework
Model Model Size M N Input Dim. Output Dim. Num. of Heads Head Dim
VideoDiT 12B 16 40 36 16 24 128
AudioDiT 2B 32 – 32 32 24 64

Audio-Video Refiner

The proposed cascaded audio-video (AV) refiner leverages a 480p base model to efficiently enable 1080p audio-video generation without excessive computational cost. On the video side, low-resolution inputs are refined to high-resolution outputs, effectively mitigating generative artifacts. For audio, the approach preserves both fidelity and audio-visual synchronization by inputting clean audio latents into a frozen Audio DiT module, thereby maintaining the quality and audio-video sync established by the base model.

Audio-Video Refiner

Comprehensive Audio-Video Data Pipelines

Going beyond conventional visual quality filtering, our work introduces a comprehensive data pipeline for joint audio-visual generation. It performs dual-quality filtering on both audio and video modalities, and employs a joint visual + audio keyword labeling system to associate a single visual object with its diverse range of audio events, enabling a more sophisticated level of audio-visual data balancing. Furthermore, we optimize and correct the Subject-Speech correspondence in multi-person and multi-shot scenarios, significantly enhancing character identity consistency and accuracy.

Comprehensive Audio-Video Data Pipelines

Role-Playing Animate

We introduce a cross-pair pipeline and a unified-editing-based reference augmentation pipeline to robustly decouple identity from static appearance, effectively mitigating copy-paste bias. Furthermore, we develop a multi-reference conditioning mechanism with a dedicated temporal offset and a dual-conditioning CFG strategy, enabling the model to treat reference images as persistent identity anchors rather than temporal frames, thus achieving superior identity consistency and motion dynamics.

Training Recipe

The Importance of Audio Training: The initial AudioDiT pre-training quality (e.g., tone authenticity, pronunciation accuracy, emotional consistency) sets the upper bound for audio performance in joint generation. Joint training primarily facilitates audio-visual synchronization, with limited impact on the fundamental audio quality. Therefore, inadequate audio pre-training cannot be meaningfully improved during subsequent joint training. Audio Sensitivity and Forgetting: The audio branch is highly sensitive to changes in training data distribution and quickly adapts, often leading to catastrophic forgetting of previously learned robust audio features. To address this, asymmetric learning rates are used to prevent audio quality degradation during joint training.

Training Recipe

Inference Optimization

The Audio DiT and Video DiT are each guided by two distinct conditions: the text prompt and cross-attention signal. The introduction of cross-attention provides a mutual signal that steers the model toward audio-video synchronization. To effectively utilize this secondary cross-attention condition, we adopt a multi-condition control scheme, treating the text prompt (positive cpos / negative cneg) and the mutual cross-attention signal (cmutual) as separate, controllable conditions for guidance.

Inference Optimization8px;" />

More Examples

Audio Video Synchronization

Generates diverse human voices and sound effects with natural lip-sync alignment. The audio quality is clear with stable spatial presence, perfectly synchronized with visual rhythm and emotional changes for coherent storytelling.

Motion Quality

Excels in high-motion scenarios with dynamic camera movements including pans, tilts, and tracking shots. Maintains smooth temporal coherence and physical plausibility even during rapid motion, with synchronized audio that matches the visual dynamics.

Photorealistic Quality

Delivers cinematic-level realism with natural skin textures, nuanced facial expressions, and lifelike movements. Every frame captures authentic human presence with convincing lighting and physical accuracy, making virtual content indistinguishable from reality.

Unlimited Creativity

Explore curated examples to inspire your next creation. From realistic scenes to animated styles, from solo performances to complex multi-character interactions, discover the endless possibilities of AI-powered audio-video generation.

Citation

If you find our work useful for your research, please consider citing:

@article{guo2026Alive,
  title={Alive: Animate Your World with Lifelike Audio-Video Generation},
  author={Ying Guo and Qijun Gan and Yifu Zhang and Jinlai Liu and Yifei Hu and Pan Xie and Dongjun Qian and Yu Zhang and Ruiqi Li and Yuqi Zhang and Ruibiao Lu and Xiaofeng Mei and Bo Han and Xiang Yin and Bingyue Peng and Zehuan Yuan},
  journal={arXiv preprint arXiv:2602.08682},
  year={2026}
}

Ethics Concerns: All human images used in our demonstrations are either copyrighted or AI-generated, and are intended solely to showcase the capabilities of this research. Please contact us if there are any concerns, and we will delete it in time.