NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

NormalCrafter introduces a novel approach for surface normal estimation in videos, leveraging diffusion priors to achieve high spatial fidelity and temporal consistency over arbitrary-length sequences.

Key Highlights:

Video Diffusion Model Repurposing – Adapts Stable Video Diffusion (SVD) for normal map prediction, maintaining temporal structure instead of RGB generation.
Semantic Feature Regularization (SFR) – Aligns intermediate diffusion features with DINO semantic embeddings, enhancing fine-grained geometric detail without inference overhead.
Two-Stage Training Protocol – Trains full U-Net in latent space for long-term temporal modeling, followed by spatial fine-tuning in pixel space for high-resolution normal accuracy.
Fine-Tuned VAE Decoder – Improves normal map reconstruction quality by adapting the VAE decoder, reducing angular errors and boosting PSNR during training.
Zero-Shot Generalization – Achieves strong results across NYUv2, iBims-1 (static images), and ScanNet, Sintel (videos) without task-specific fine-tuning.
Superior Quantitative Results – Outperforms baselines (DSINE, StableNormal, Marigold-E2E-FT) with up to 1.6° lower mean angular error and +3.1% better pixel accuracy under 30° error on Sintel videos.
Temporal Stability – Produces smoother y-t slices compared to prior methods, eliminating flickering artifacts under large motion and dynamic scenes.
Efficient Semantic Enhancement – SFR operates only during training, adding no inference latency or memory cost.
Flexible Single-Image Compatibility – Capable of single-frame normal estimation by setting frame length to one, maintaining competitive static accuracy.
Extensive Validation – Evaluated across DAVIS, Sora-generated videos, NYUv2, iBims-1, ScanNet, Sintel benchmarks, confirming robustness to diverse environments.

Project

Project Page: https://normalcrafter.github.io/
Paper: https://arxiv.org/abs/2504.11427
Github: https://github.com/Binyr/NormalCrafter

DepthPro – Monocular Metric Depth Estimation: https://learnopencv.com/depth-pro-monocular-metric-depth/
Sapiens: Foundation for Human Vision Models: https://learnopencv.com/sapiens-human-vision-models/
Depth Estimation: https://learnopencv.com/author/kaustubh-sadekar/
Research Papers: https://opencv.org/blog/category/research-papers/

The post NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors appeared first on OpenCV.

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Project

Related articles from LearnOpenCV