Samsung Researchers Introduced ANSE: Improving Text-to-Video Diffusion Models

Samsung Researchers Introduced ANSE: A Model-Aware Framework for Improving Text-to-Video Diffusion Models through Attention-Based Uncertainty Estimation

Video generation models have become a core technology for creating dynamic content by transforming text prompts into high-quality video sequences. Diffusion models, in particular, have established themselves as a leading approach for this task. These models work by starting from random noise and iteratively refining it into realistic video frames. Text-to-video (T2V) models extend this capability by incorporating temporal elements and aligning generated content with textual prompts, producing videos that are both visually compelling and semantically accurate.

Despite advancements in architecture design, a significant challenge remains: ensuring consistent, high-quality video generation across different runs when the only change is the initial random noise seed. This challenge has highlighted the need for smarter, model-aware noise selection strategies to avoid unpredictable outputs and wasted computational resources.

The Core Problem

The core problem lies in how diffusion models initialize their generation process from Gaussian noise. The specific noise seed used can drastically impact the final video quality, temporal coherence, and prompt fidelity. Current approaches often attempt to address this problem with handcrafted noise priors or frequency-based adjustments. However, these methods can be computationally expensive and may not effectively leverage the model’s internal attention signals, necessitating a more principled, model-aware method for guiding noise selection.

Introducing ANSE

The research team from Samsung Research introduced ANSE (Active Noise Selection for Generation), a framework that uses internal model signals to guide noise seed selection during video generation. At the core of ANSE is BANSA (Bayesian Active Noise Selection via Attention), a novel acquisition function that quantifies the consistency and confidence of the model’s attention maps under stochastic perturbations.

How BANSA Works

BANSA evaluates entropy in the attention maps generated during early denoising steps. The researchers found that specific layers in the models provided sufficient correlation with full-layer uncertainty estimates, significantly reducing computational overhead. The BANSA score, computed by comparing the average entropy of individual attention maps to the entropy of their mean, ranks candidate noise seeds. The noise seed with the lowest BANSA score is then selected to generate the final video.

Performance Metrics

On the CogVideoX-2B model, the total VBench score improved from 81.03 to 81.66 (+0.63), with gains of +0.48 in quality score and +1.23 in semantic alignment. On the larger CogVideoX-5B model, ANSE increased the total VBench score from 81.52 to 81.71 (+0.25), with a +0.17 gain in quality and +0.60 gain in semantic alignment. Importantly, these improvements were achieved with only an 8.68% increase in inference time for CogVideoX-2B and 13.78% for CogVideoX-5B, in contrast to previous methods that required significantly higher increases in inference time.

Advantages of ANSE

ANSE improves total VBench scores for video generation: from 81.03 to 81.66 on CogVideoX-2B and from 81.52 to 81.71 on CogVideoX-5B.
Quality and semantic alignment gains of +0.48 and +1.23 for CogVideoX-2B; +0.17 and +0.60 for CogVideoX-5B.
Modest inference time increases: +8.68% for CogVideoX-2B and +13.78% for CogVideoX-5B.
BANSA scores outperformed random and entropy-based methods for noise selection.
Efficient layer selection strategy reduces computational load while maintaining performance.

Conclusion

In summary, the research introduced a model-aware noise selection framework that leverages internal attention signals to tackle the challenge of unpredictable video generation in diffusion models. By utilizing BANSA to quantify uncertainty and selecting noise seeds that minimize this uncertainty, the researchers provided a principled, efficient method for enhancing video quality and semantic alignment in text-to-video models.

For further details, check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Feel free to follow us on Twitter and join our 95k+ ML SubReddit for more insights.

Samsung Researchers Introduced ANSE (Active Noise Selection for Generation): A Model-Aware Framework for Improving Text-to-Video Diffusion Models through Attention-Based Uncertainty Estimation