BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

BlockVid represents a major leap forward in long-video generation, tackling one of the hardest open problems in video generation, i.e, producing coherent, high-fidelity, minute-long clips without collapse, drift, or degradation over time. Developed by DAMO Academy, ZIP Lab, and Hupan Lab, BlockVid enhances the semi-autoregressive block diffusion paradigm with innovations that directly address KV-cache error accumulation which ensures stable, realistic, and compelling long-horizon video synthesis.

Key Highlights

Semantic-Aware Sparse KV Cache: A novel KV-cache mechanism selectively stores only the most meaningful tokens, retrieving semantically aligned historical context instead of blindly accumulating past errors. This dramatically reduces long-horizon drift and preserves subject/background consistency.

Block Forcing + Self Forcing Training Strategy: A new training recipe combines Block Forcing (for chunk-to-chunk semantic alignment) with Self Forcing (closing the train–test gap), preventing models from drifting, morphing identities, or losing scene structure as the video grows longer.

Chunk-Level Noise Scheduling & Shuffling: Noise progressively increases across chunks and is locally shuffled at boundaries, smoothing transitions and minimizing abrupt artifacts that commonly plague minute-long generations.

LV-Bench: Fine-Grained Long-Video Benchmark: The authors introduce LV-Bench, a dataset of 1,000 minute-long videos with dense 2–5s annotations, plus a new Video Drift Error (VDE) metric suite to quantify long-range temporal consistency (clarity, subject identity, motion, aesthetics, background stability).

State-of-the-Art Long-Video Coherence: BlockVid significantly outperforms leading baselines like MAGI-1, SkyReels-V2, FramePack, and Self Forcing:

22.2% improvement on VDE-Subject
19.4% improvement on VDE-Clarity
Top scores in subject consistency, background stability, motion smoothness, and image quality on LV-Bench and VBench.

High-Quality Minute-Long Video Generation

BlockVid maintains sharpness, color stability, and structural coherence across 60-second clips, where many models collapse after ~10–20 seconds.

Why It Matters

BlockVid demonstrates a critical breakthrough:

scaling diffusion-based video generation to realistic minute-long horizons with stability, fidelity, and semantic coherence.

This unlocks new potential for:

Long-form storytelling & filmmaking
World models & simulation
Virtual production & advertising
Embodied AI training environments
High-resolution cinematic generation
Real-time creative workflows

BlockVid’s semi-autoregressive design shows the future of video AI lies in chunkwise generation with intelligent memory, coherent dynamics, and architecture-level drift control.

Explore More

Paper: arXiv:2511.22973v1
Project Page: https://ziplab.co/BlockVid